Estimation of large block covariance matrices: Application to the analysis of gene expression data
Motivated by an application in molecular biology, we propose a novel, efficient and fully data-driven approach for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our approach consists in approximating such a covariance matrix by the sum of a low-rank sparse matrix and a diagonal matrix. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package BlockCov which is available from the Comprehensive R Archive Network and from GitHub. In order to illustrate the statistical and numerical performance of our package some numerical experiments are provided as well as a thorough comparison with alternative methods. Finally, our approach is applied to gene expression data in order to better understand the toxicity of acetaminophen on the liver of rats.
Driven by a wide range of applications ranging from economics to biology, the estimation of large covariance matrices has drawn considerable recent attention, see for instance . In the high-dimensional setting where the dimension of the covariance matrix is much larger than the sample size, it is well known that the commonly used sample covariance matrix perform poorly. In recent years, researchers have proposed various regularization techniques to consistently estimate large covariance matrices. To estimate such matrices, one of the key assumptions made in the literature is that the matrix of interest is sparse, namely many entries are equal to zero. A number of regularization approaches including banding, tapering, thresholding and minimization, have been developed to estimate large covariance matrice or their inverse. See, for instance, , ,  among many others. For further references, we refer the reader to  and to the review of .
In this paper, we shall consider the following framework. Let , zero-mean i.i.d. -dimensional random vectors having a covariance matrix such that the number of its rows and columns is much larger than . The goal of the paper is to propose a new estimator of and of the square root of its inverse, in the particular case where is assumed to have a block structure without limiting ourselves to diagonal blocks. More precisely, in this paper, we shall assume that
where is a sparse matrix with , denotes the transpose of the matrix and is a diagonal matrix such that the diagonal terms of are equal to one. Two examples of such matrices and are given in Figure 1 in the case where and and in the case where the columns of do not need to be permuted in order to see the block structure. Based on (1), our model could seem to be close to factor models described in  and . However, in , the high-dimensional aspects are not considered and in  the sparsity constraint is not studied. Note also that the block diagonal assumption has already been recently considered by  for estimating the inverse of large covariance matrices in high-dimensional Gaussian Graphical Models (GGM).
We shall moreover propose a methodology to estimate in the case where the block structure is latent that is when the columns and rows of have to be permuted according to an unknown permutation in order to make the block structure appear. An example of such a matrix is given in Figure 2 in the case where and .
Our approach is fully data-driven and consists in providing a low rank matrix approximation of the part of and then in using a regularization in order to obtain a sparse estimator of . In the case where the block structure is latent, a hierarchical clustering step has to be applied beforehand.
2. Statistical inference
The strategy that we propose for estimating can be summarized as follows.
First step: Low rank approximation. In this step, we propose to approximate the part of by a low rank matrix using a Singular Value Decomposition (SVD).
Second step: Detecting the position of the non null values. In this step, we use a Lasso criterion in order to propose a sparse estimator of .
Third step: Positive definiteness. In order to ensure that the final estimator of is positive definite we use the methodology proposed by .
2.1. Low rank approximation
By definition of in (1), is a low rank matrix having its rank smaller or equal to . In the first step, our goal is thus to propose a low rank approximation of an estimator of .
Let be the sample covariance matrix defined by
where . The corresponding sample correlation matrix is defined by:
Let us also consider the matrix defined by:
We shall use a rank approximation of by considering in its singular value decomposition only the largest singular values and by replacing the other ones by 0. By the Eckart-Young-Mirsky theorem, this corresponds to the best rank approximation of . can thus be seen as a low rank approximation of an estimator of . The choice of will be discussed in Section 2.3.
2.2. Detecting the position of the non null values
Let us first explain the usual framework in which the Lasso approach is used. We consider a linear model of the following form
where , and are vectors and is sparse meaning that it has a lot of null components.
In such models a very popular approach initially proposed by  is the Least Absolute Shrinkage eStimatOr (Lasso), which is defined as follows for a positive :
where, for , and , i.e. the -norm of the vector . Observe that the first term of (5) is the classical least-squares criterion and that can be seen as a penalty term. The interest of such a criterion is the sparsity enforcing property of the -norm ensuring that the number of non-zero components of the estimator of is small for large enough values of . Let
where defined in Section 16.4 of  is such that for a matrix ,
where is the sub-vector of the column of obtained by striking out the first elements. In order to estimate the sparse matrix , we need to propose a sparse estimator of . To do this we apply the Lasso criterion described in (5), where is the identity matrix. In the case where is an orthogonal matrix it has been shown in  that the solution of (5) is:
where denotes the th column of . Using the fact that is the identity matrix we get
We then reestimate the non null coefficients using the least-squares criterion and get:
where is defined in (6).
We thus get by putting to zero the value of that are under a given threshold. Then, we choose the upper triangular part of the estimator of to be equal to . Since the diagonal terms of are assumed to be equal to 1, we take the diagonal terms of equal to 1. The lower triangular part of is then obtained by symmetry.
2.2.1. Positive definiteness
2.3. Choice of the parameters
Our methodology depends on two parameters: The number of singular values kept for defining and the number of non zero values in defined in (7). We propose to first estimate and then estimate the number of non null values. In both cases, we propose to use a kind of “elbow” criterion that is detailed hereafter.
To choose , we propose to fit to the eigenvalues sorted in the non decreasing order two simple linear regressions and to choose those achieving the best fit. We choose for the index which precedes the one where the simple linear regression changes.
To choose the number of non null values in defined in (7), we compute for different values of the Frobenius norm of the difference , that is , where . As before we then fit two simple linear regressions and we choose the value of and thus the number of non values achieving the best fit.
3. Numerical experiments
Our methodology described in the previous section is implemented in the R package BlockCov and is available from the CRAN (Comprehensive R Achive Network) and from GitHub.
We propose hereafter to investigate the performance of our approach for different types of matrices defined in (1) and for different values of and . The four following cases considered correspond to different types of matrices , the matrices being chosen accordingly to ensure that the matrix has its diagonal terms equal to 1.
Diagonal-Equal case. In this situation, has the structure displayed in the left part of Figure 1, namely it has 5 columns such that the numbers of the non values in the five columns are equal to , , , and , respectively and the non null values are equal to , , , and , respectively.
Diagonal-Unequal case. In this scenario, has the same structure as for the Diagonal-Equal case except that the non null values in the five columns are not fixed but randomly chosen in except for the third column for which its values are randomly chosen in .
Extra-Diagonal-Equal case. Here, has the structure displayed in the right part of Figure 1. The values of the columns of are the same as those of the Diagonal-Equal case except for the fourth column which is assumed to contain additional non values equal to -0.5 in the range .
Extra-Diagonal-Unequal case. has the same structure as in the Extra-Diagonal-Equal case except that the values are randomly chosen as in the Diagonal-Unequal case except for the fourth column where the additional non values are still equal to -0.5 in the range .
For and , 100 matrices were generated such that its rows are i.i.d. -dimensional zero-mean Gaussian vectors having a covariance matrix chosen according to the four previous cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal or Extra-Diagonal-Unequal.
3.1. Low rank approximation
For all of the four scenarios, the approach for choosing described in Section 2.3 is illustrated in Figure 3. We can see from this figure that our methodology finds the right value of which is here equal to 5 in all the cases.
To go further, we investigate the behavior of our methodology from 100 replications of the matrix for the four different types of . Figure 4 displays the corresponding means and the standard deviations of . We can see from this figure that our approach can efficiently estimate the rank of , which is equal to 5 for the different values of and considered.
3.2. Number of non null values
For the four scenarios, the approach for choosing and hence the number of non null values in described in Section 2.3 is illustrated in Figure 5. We observe that the number of non null values seems to be slightly over estimated in the Equal cases and under estimated in the Unequal cases.
Then, we investigate the behavior of our methodology from 100 replications of the matrix for the four different types of . Figure 6 displays the boxplots of the non null estimated values in the four different scenarios and for different values of and . We can see once again from this figure that the number of non null values seems to be slightly over estimated in the Equal cases and under estimated in the Unequal cases.
3.3. Position of the non null values
In this section, we study the performance of our method for retrieving the non null positions of the matrix . The results are displayed in Figure 7. We can see from this figure that the diagonal blocks having their values ranging from to are always detected as non null. We can also observe that even if the extra-diagonal blocks have their entries equal to , they are estimated as being non null with a frequency of over the different replications.
3.4. Comparison with other methodologies
The goal of this section is to compare the statistical performance of our approach with other methodologies.
Since our goal is to retrieve groups within the covariance matrix, we shall use clustering techniques. Once the groups or blocks have be obtained, is estimated by assuming that the corresponding matrix estimator is block-wise constant except for the diagonal blocks for which the diagonal entries are equal to 1 and the extra-diagonal terms are assumed to be equal. This gives a great advantage to these methodologies in the Diagonal-Equal and in the Extra-Diagonal-Equal scenarios. More precisely, let denote the value of the entries in the block having its rows corresponding to Group (or Cluster) and its columns to Group (or Cluster) . Then, for a given clustering :
where denotes the cluster , denotes the number of elements in the cluster and is the entry of the matrix defined in Equation (2).
For the matrices corresponding to the four scenarios previously described, we shall compare the statistical performance of the following methods:
empirical which estimates by defined in (2),
blocks which estimates using the methodology described in this article,
blocks_real which estimates using the methodology described in this article when and the number of non null values are assumed to be known,
In order to improve the performance of the clustering approaches: hclust, Specc and kmeans, the real number of clusters has been provided to these methods. The performance of the different approaches are assessed using the Frobenius norm of the difference between and its estimator.
Figure 8 displays the mean and the standard deviations of the Frobenius norm of the difference between and its estimator for different values of and in the four different cases: Diagonal-Equal, Diagonal-Unequal, Extra-Diagonal-Equal and Extra-Diagonal-Unequal. We can see from this figure that in all the scenarios the blocks and blocks_real approaches outperform the others. Moreover, since the performance of blocks and blocks_real are very similar, it means that the performance of our approach are not altered by our way of estimating and .
3.5. Columns permutation
In practice, it may occur that the columns of consisting of the rows are not ordered in a way which makes blocks appear in the matrix . To address this issue, we propose to perform a hierarchical clustering on beforehand and use the obtained permutation of the observations which guarantees that a cluster plot using this ordering will not have crossings of the branches. Let us denote the matrix in which the columns have been permuted according to this ordering and the covariance matrix of each row of . Then, we apply our methodology to which should provide an efficient estimator of . In order to get an estimator of the columns and rows are permuted according to the ordering coming from the hierarchical clustering.
To assess the corresponding loss of performance, we generated for each matrix used for making Figure 8 a matrix in which the columns of were randomly permuted. The associated covariance matrix is denoted . Then, we applied the methodology described in the previous paragraph denoted blocks_samp in Figure 9 thus providing . The performance of this new methodology was compared to the methodology that we proposed in the previous sections (denoted blocks in Figure 9) when the columns of were not permuted. The results are displayed in Figure 9. We can see from this figure that the performance of our approach does not seem to be altered by the permutation of the columns.
3.6. Numerical performance
Figure 10 displays the computational times of our methodology for different values of ranging from 100 to 3000 and . The timings were obtained on a workstation with 16 GB of RAM and Intel Core i7 (3.66GHz) CPU. Our methodology is implemented in the R package BlockCov which uses the R language (R Core Team, 2017) and relies on the R package Matrix. We can see from this figure that it takes around 3 minutes to estimate a correlation matrix.
3.7. Estimator of
Even if providing an estimator of a large covariance matrix can be very useful in practice, it may also be interesting to efficiently estimate . Such an estimator can indeed be used in the general linear model in order to remove the dependence that may exist between the columns of the observations matrix. For further details on this point, we refer the reader to  in which such an approach is proposed for performing variable selection in the multivariate linear model in the presence of dependence between the columns of the observation matrix.
Since is a symmetric positive definite matrix, it can be rewritten as , where is a diagonal matrix and is an orthogonal matrix. The matrix can thus be estimated by where is a diagonal matrix having its diagonal terms equal to the square root of the inverse of the singular values of . However, inverting the square root of too small eigenvalues may lead to poor estimators of . This is the reason why we propose to estimate by
where is a diagonal matrix such that its diagonal entries are equal to the square root of the inverse of the diagonal entries of except for those which are smaller than a given threshold which are replaced by 0 in .
Since we are interested in assessing the ability of to remove the dependence that may exist between the columns of , we shall consider the Frobenius norm of which should be close to zero, where denotes the identity matrix of . Figure 11 displays the Frobenius norm of for different threshold . A threshold of 0.1 seems to provide a small error in terms of Frobenius norm. Hence, in the following, will be equal to 0.1 and will be referred as .
This technique was applied to all of the estimators of discussed in Section 3.4 to get different estimators of . The Frobenius norm of the error is used to compare the different estimators obtained by considering the different estimators of . The results are displayed in Figure 12. We observe from this figure that in the Diagonal-Equal case the estimators of derived from the hclust estimator of perform better than the other estimators. In all the other scenarios the estimators of derived from the blocks and blocks_real estimators of seem to be more adapted than the others to remove the dependence among the columns of .
Then, the estimators of derived from blocks and blocks_real were compared to the GRAB estimator proposed by . Since the computational burden of GRAB is high for large values of , we limit ourselves to the Extra-Diagonal-Equal case when and for the comparison. Figure 13 displays the results. Once again, blocks and blocks_real provide better results than GRAB. However, it has to be noticed that the latter approach depends on a lot of parameters that were difficult to choose, thus we used the default ones.
4. Application to liver toxicity data
In this section, we apply our R package BlockCov to the liver toxicity data set available from the mixomics package. This data set contains the expression measure of 3116 genes and 10 clinical measurements for 64 subjects (rats) that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment. We fit to the genes expression measures a multivariate linear model as if the columns of the observation matrix were independent and thus derive a residual matrix which is assumed to have the same properties as the matrix . Our goal will thus be to estimate the correlation matrix of this residual matrix by using the methodology developped in the paper.
4.1. Low rank approximation
Let be the matrix defined from the sample correlation matrix of the residual matrix by (3) once its columns and rows have been permuted using the ordering provided by the hierarchical clustering. Its eigenvalues sorted in the decreasing order are displayed in Figure 14. From this figure, we choose .
4.2. Number of non null values
The number of non null values in was then selected using the “elbow” criterion which is displayed in Figure 15. From the 4853170 values of the upper-triangular part of the matrix, 1106116 were detecting as non null values, which corresponds to of non values.
4.3. Estimation of the correlation matrix
Figure 16 displays the estimation of the correlation matrix of the residuals of the genes expression measures once the rows and the columns of the residual matrix have been permuted according to the ordering provided by the hierarchical clustering.
We can see from Figure 16 that there is a cluster of highly correlated genes. To see if this cluster has a biological meaning we propose in the next section a gene ontology analysis to go further into the analysis.
4.4. Gene ontology analysis
In this section, we perform an enrichment study in biological function described in  by using Bioprofiling.de for the list of genes which appear to be highly correlated in Figure 16. It consists in comparing the odd-ratio of a given gene list for the different biological functions with the odd-ratio of the whole genome of rats for the same biological functions, using a Chi-square test and a correction for multiple testing. The results are displayed in Table 1.
|CYTOPLASM||100 (362)||3305 (18868)||1,58||0,00095452434898226|
|MITOCHONDRION||49 (362)||1282 (18868)||1,99||0,00290007110939337|
|PROTEIN BINDING||158 (362)||6688 (18868)||1,23||0,01|
|ENDOPLASMIC RETICULUM MEMBRANE||17 (362)||384 (18868)||2,31||0,01|
|LIPID METABOLIC PROCESS||10 (362)||144 (18868)||3,62||0,01|
|LIGASE ACTIVITY||13 (362)||158 (18868)||4,29||0,01|
|ACID-AMINO ACID LIGASE ACTIVITY||7 (362)||44 (18868)||8,29||0,01|
|NUCLEUS||87 (362)||3463 (18868)||1,31||0,01|
|MRNA PROCESSING||9 (362)||116 (18868)||4,04||0,01|
|PROTEIN K48-LINKED UBIQUITINATION||5 (362)||22 (18868)||11,85||0,01|
|NUCLEOTIDE BINDING||47 (362)||1488 (18868)||1,65||0,01|
|METAL ION BINDING||47 (362)||1526 (18868)||1,61||0,01|
|GOLGI APPARATUS||27 (362)||608 (18868)||2,31||0,01|
|FATTY ACID METABOLIC PROCESS||8 (362)||64 (18868)||6,52||0,01|
|ENDOPLASMIC RETICULUM||26 (362)||688 (18868)||1,97||0,01|
|CELLULAR_COMPONENT||47 (362)||1572 (18868)||1,56||0,01|
|MITOCHONDRIAL MATRIX||11 (362)||125 (18868)||4,59||0,01|
|SMALL GTPASE MEDIATED SIGNAL TRANSDUCTION||11 (362)||192 (18868)||2,99||0,02|
|CYTOSOL||39 (362)||1263 (18868)||1,61||0,02|
|MOLECULAR_FUNCTION||46 (362)||1651 (18868)||1,45||0,04|
We can see from this table that the “LIPID METABOLIC PROCESS” and the “FATTY ACID METABOLIC PROCESS” appear to have an odd-ratio significantly different for this list of genes. Besides, these biological functions have already been identified as being specific of the liver, see for instance  and .
Since, by , hepatic injury and subsequent hepatic failure due to overdose of acetaminophen has affected patients for decades, this could be an indication that our approach was able to efficiently identify interesting correlated clusters of genes.
In this paper, we propose a fully data-driven methodology for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package BlockCov which is available from the Comprehensive R Archive Network and from GitHub. In the course of this study, we have shown that BlockCov is a very efficient approach both from the statistical and numerical point of view. Moreover, its very low computational load makes its use possible even for very large covariance matrices having several thousands of rows and columns.
-  A. V. Antonov. BioProfiling.de: analytical web portal for high-throughput cell biology. Nucleic Acids Res., 39(Web Server issue):W323–327, Jul 2011.
-  O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485–516, 2008.
-  D. Bates and M. Maechler. Matrix: Sparse and Dense Matrix Classes and Methods, 2018. R package version 1.2-13.
-  P. J. Bickel and E. Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577–2604, 12 2008.
-  T. T. Cai and M. Yuan. Adaptive covariance matrix estimation through block thresholding. Ann. Statist., 40(4):2014–2042, 08 2012.
-  A. Canbay, L. Bechmann, and G. Gerken. Lipid metabolism in the liver. Z Gastroenterol, 45(1):35–41, 2007. doi: 10.1055/s-2006-927368.
-  E. Devijver and M. Gallopin. Block-diagonal covariance selection for high-dimensional gaussian graphical models. Journal of the American Statistical Association, 113(521):306–314, 2018.
-  J. Fan, F. Han, and H. Liu. Challenges of big data analysis. National Science Review, 1(2):293–314, 2014.
-  J. Fan, L. Yuan, and L. Han. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1–C32, 2016.
-  K. N. Frayn, P. Arner, and H. Yki-Järvinen. Fatty acid metabolism in adipose tissue, muscle and liver in health and disease. Essays In Biochemistry, 42:89–103, 2006.
-  C. Giraud. Introduction to High-Dimensional Statistics. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 2014.
-  D. Harville. Matrix Algebra: Exercises and Solutions: Exercises and Solutions. Springer New York, 2001.
-  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
-  N. J. Higham. Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis, 22(3):329–343, 2002.
-  M. J. Hosseini and S.-I. Lee. Learning sparse gaussian graphical models with overlapping blocks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3808–3816. Curran Associates, Inc., 2016.
-  R. A. Johnson and D. W. Wichern, editors. Applied Multivariate Statistical Analysis. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
-  O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365 – 411, 2004.
-  M. Perrot-Dockès, C. Lévy-Leduc, L. Sansonnet, and J. Chiquet. Variable selection in multivariate linear models with high-dimensional covariance matrix estimation. Journal of Multivariate Analysis, 166:78 – 97, 2018.
-  R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.
-  U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, Dec 2007.
-  E. Yoon, A. Babar, M. Choudhary, M. Kutner, and N. Pyrsopoulos. Acetaminophen-induced hepatotoxicity: a comprehensive update. Journal of Clinical and Translational Hepatology, 4(2):131–142, 2016. doi:10.14218/JCTH.2015.00052.