Block-diagonal covariance selection for high-dimensional Gaussian graphical models
Gaussian graphical models are widely utilized to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a non-asymptotic model selection procedure supported by strong theoretical guarantees based on an oracle type inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network.
Key words and phrases:Network inference, graphical lasso, variable selection, non-asymptotic model selection, slope heuristic
- 1 Introduction
- 2 A method to detect block-diagonal covariance structure
- 3 Theoretical results for non-asymptotic model selection
- 4 Simulation study
- 5 Real data analysis
- 6 Discussion
7 Appendix: Theoretical proofs
- 7.1 Model collection and discretization
- 7.2 Oracle inequality: proof of Theorem 3.1
- 7.3 Lower bound for the minimax risk: proof of Theorem 3.2
Graphical models [Whi90] have become a popular tool for representing conditional dependencies among variables using a graph. For Gaussian graphical models (GGMs), the edges of the corresponding graph are the non-zero coefficients of the inverse covariance matrix. To estimate this matrix in high-dimensional contexts, methods based on an penalized log-likelihood have been proposed [MB06, YL07, BEGd08]. A popular method is the graphical lasso algorithm introduced by [FHT08]. Gaussian graphical models have many potential applications, such as the reconstruction of regulatory networks from real omics data [KSI11, ANW14]. However, these methods often perform poorly in so-called ultra high-dimensional contexts [Gir08, Ver12], when the number of observations is much smaller than the number of variables. A small sample size is a common situation in various applications, such as in systems biology where the cost of novel sequencing technologies may limit the number of available observations [FLL11]. In practice, the network reconstruction problem is facilitated by restricting the analysis to a subset of variables, based on external knowledge and prior studies of the data [ACM09, YL11]. When no external knowledge is available, only the most variable features are typically kept in the analysis [GLMZ11, AL13]. Choosing the appropriate subset of variables to focus on is a key step in reducing the model dimension and the number of parameters to estimate, but no procedure is clearly established to perform this selection when the sample size is really low.
In the context of graphical lasso estimation, [WFS11] and [MH12] have noticed a particular property: the block-diagonal structure of the graphical lasso solution is totally determined by the block-diagonal structure of the thresholded empirical covariance matrix. The graphical lasso estimation for a given level of regularization can be decomposed into two steps. First, the connected components of the graphical model are detected, based on the absolute value of the sample covariance matrix thresholded at level . Second, the graphical lasso problem is solved in each connected component independently using the same regularization parameter for each subproblem. This decomposition is of great interest to reduce the number of parameters to estimate for a fixed level of regularization. It has been exploited for large-scale problems [ZLR12] and for joint graphical lasso estimations [DWW14]. [HSDR14] have improved the computational cost of the two-step procedure by using a quadratic approximation, and have proved the superlinear convergence of their algorithm. [TWS15] have noticed that the first step, i.e. the detection of connected components by thresholding, is equivalent to performing a single linkage clustering on the variables. They have proposed the cluster graphical lasso, using an alternative to single linkage clustering in the two-step procedure, such as average linkage clustering. The selection of the cutoff applied to hierarchical clustering in the first step of the cluster graphical lasso algorithm is performed independently from the selection of the regularization parameters in the second step of the algorithm. Their results suggest that the detection of the block diagonal structure of the covariance matrix prior to network inference in each cluster can improve network inference. Other authors have recently proposed procedures to detect the block-diagonal structure of a covariance matrix. [PBT12] provided a method to detect this structure for high-dimensional supervised classification that is supported by asymptotic guarantees. [HSNP15] proposed tests to perform this detection and derived consistency for their method when the number of variables and the sample size tend to infinity.
In this paper, we adapt the two-step procedure proposed by [WFS11] and [MH12] to infer networks of conditional dependencies between variables. To improve network inference performance, we decouple the two steps of the procedure, using different parameters for thresholding the empirical covariance matrix and for estimation within each connected component of the network, as proposed by [TWS15]. The main contribution of our work stands in the use of a non-asymptotic criterion to perform the first step of the two-step procedure, i.e. the detection of the block-diagonal structure of a covariance matrix. In our procedure, we recast the detection problem into a model selection problem and aim at choosing the best model among a collection of multivariate distributions with block-diagonal covariance matrices. We obtained non-asymptotic theoretical guaranties, based on the control of the risk of the model selection procedure. These results provide a penalty to select a model. This penalty is known up to multiplicative constants depending on assumptions hard to explicitly satisfy in practice. To calibrate these constants in practice, we use the slope heuristic, originally proposed by [BM01] and detailed in [BMM12]. Unlike other methods to detect the appropriate block-diagonal covariance matrix [PBT12, TWS15, HSNP15], our procedure is non-asymptotic and offers strong theoretical guarantees when the number of observations is limited, which is of great interest for many real applications. More precisely, we prove that our estimator is adaptive minimax to the structure of the covariance matrix.
The paper is organized as follows. In Section 2, after providing basic notations and definitions, the non-asymptotic method to detect the block-diagonal structure of the GGM is presented, as well as the complete framework to infer network where the number of observations is limited. Section 3 details theoretical results supporting our model selection criterion. In particular, an oracle type inequality upper bounds the risk between the true model and the model selected among the collection of models, and a minimax lower bound guarantees that the non-asymptotic procedure has an optimal rate of convergence. Section 4 investigates the numerical performance of our method in a simulation study. Section 5 illustrates our procedure on a real gene expression RNA-seq dataset with a limited sample size. All proofs are provided in Appendix 1.
2. A method to detect block-diagonal covariance structure
Let be a sample in from a multivariate normal distribution with density where for all . Let be the empirical covariance matrix associated with the sample . Our goal is to infer the graph of conditional dependencies between variables, encoded by the precision matrix . Since the matrices and have the same block-diagonal structure, we first seek to detect the optimal block-diagonal structure of the covariance matrix , i.e. the optimal partition of variables into blocks. We index the variables from to . We note the partition of variables into blocks where is the subset of variables in block , and is the number of variables in block . The partition describes the block-diagonal structure of the matrix: off the block, all coefficients of the matrix are zeros. Note that we authorize the permutation inside blocks: e.g. the partition of variables into blocks is equivalent to the partition . We also authorize reordering of the blocks: e.g. is equivalent to . We consider the following set of multivariate normal densities with block-diagonal covariance matrices:
where is the set of positive semidefinite matrices of size , and are real numbers, are the smallest and highest eigenvalues of and is a permutation matrix leading to a block-diagonal covariance matrix.
We consider the set of all possible partitions of variables. In theory, we would like to consider the collection of models
However, the set is large: its size is the Bell number. An exhaustive exploration of the set is then not possible even for a moderate number of variables . We restrict our attention to the sub-collection:
of where is the partition of variables corresponding to the block-diagonal structure of the adjacency matrix , based on the thresholded absolute value of the sample covariance matrix . Recall that [MH12] have proved that the class of block-diagonal structures detected by thresholding of the sample covariance is the same class of block-diagonal structures detected by the graphical lasso algorithm when the regularization parameter varies, which supports the fact that we restrict our attention to this specific sub-collection. Note that the data is scaled if needed so that the set of thresholds covers all possible partitions derived from .
Once we have constructed the collection of models , we select a model among this collection using the following model selection criterion:
where is a penalty term to define and where is the maximum likelihood estimator of . The matrix is constructed block by block, using the sample covariance matrix of the dataset restricted to variables in each block.
The penalty term is based on non-asymptotic model selection properties, as detailed in Section 3: where is the dimension of the model and are two constants depending on absolute constants and on the bounds and . In practice, we consider a simpler version of the penalty term:
where is a constant depending on absolute constants and on the bounds and . Such simplification has already been proposed by [Leb05]. The extra term is useful to overpenalize the collection of models when it contains many models with the same sizes. The simplification of the penalty term (4) is reasonable for moderate number of variables.
Subsequently, we note that the bounds and are non-tractable. For this reason, we prefer to calibrate the constant in (4) from the data. This calibration is based on the slope heuristic, originally proposed and proved in the context of heteroscedastic regression with fixed design [BM07, BGH09], and for homoscedastic regression with fixed design [AM09]. In other contexts, the slope heuristic has been used and have proven to be effective for multiple change point detection [Leb05], for variable selection in mixture models [MM11], for choosing the number of components in Poisson mixture models [RMRMMC15] or for selecting the number of components in discriminative functional mixture models [BCJ15].
[BMM12] have provided practical tools to calibrate the coefficient in (4) based on the slope heuristic developed by [BM07]. We describe these tools in Section 4. Note that the detection of the optimal is easy to implement in practice and does not rely on heavy computation such as cross-validation techniques.
Once we have detected the optimal block-diagonal structure of the GGM, network inference is performed independently in each block using the graphical lasso [FHT08].
We summarize the method proposed to infer network in high-dimensional context.
(Block-diagonal covariance structure detection) Select the modularity structure of the network.
Compute the sample covariance matrix .
Construct the sub-collection of partitions , where is a set of thresholds, more precisely the set of values taken by the sample covariance matrix in absolute value. Each partition corresponds to the block-diagonal structure of the matrix .
For each partition , compute the corresponding maximum log-likelihood of the model.
Based on the log-likelihood associated to each partition in , calibrate the penalty in equation (4) to select the partition using the slope heuristic.
(Network inference in each module) For each group of variables in the selected partition , infer the network using the graphical lasso introduced by [FHT08]. The choice of the regularization parameter for the graphical lasso algorithm is performed independently in each module.
3. Theoretical results for non-asymptotic model selection
The model selection procedure presented in Section 2 is justified by theoretical results. We obtain bounds on the risk between the selected density and the true one, which prove that we select a good block-diagonal structure. More precisely, we aim at selecting, among , the optimal partition . First, for each model indexed by , we consider the density where is the maximum likelihood estimator of . Among all , we want to select the density which is the closest one to the true distribution . To measure the distance between the two densities, we define the risk:
where is a distance between two densities. Ideally, we would like to select the partition that minimizes the risk : this partition is called the oracle. Unfortunately, it is not reachable in practice because the true density is unknown. However, we will prove that we do almost as well as the oracle, i.e. we select a model for which the risk of the procedure is upper bounded by the oracle risk, up to a constant.
Before stating the theorem, we recall the definition of the Hellinger distance between two densities and defined on , and the Kullback-Leibler divergence between two densities and defined on ,
Let be the observations, arising from a density .
Consider the collection of models defined in (2).
We denote by the maximum likelihood estimator for the model . Let as defined in (3).
Let , and for all , let such that:
Then, there exists some absolute constants and such that whenever
for every , with , the random variable such that
exists and, moreover, whatever the true density ,
The proof is presented in Supplementary Material A. This theorem is deduced from an adaptation of a general model selection theorem for maximum likelihood estimator developed by [Mas07]. This adaptation allows to focus on a random sub-collection of the whole collection of models and is also proved in Supplementary Material A. To use this theorem, the main assumptions to satisfy are the control of the bracketing entropy of each model in the whole collection of models and the construction of weights for each model to control the complexity of the collection of models. To compute the weights, we use combinatorics arguments. The control of the bracketing entropy is a classical tool to bound the Hellinger risk of the maximum likelihood estimator, and has already been done for Gaussian densities in [GW00] and [MM11].
The assumption on the true density (5) is done because we consider a random subcollection of models from the whole collection of models . Thanks to this assumption, we use the Bernstein inequality to control the additional randomness. The parameter depends on the true unknown density and cannot be explicitly determined for this reason. We could do some hypothesis on the true density to be able to explicit but we choose not to do it: e.g. under the assumption that the Kullback-Leibler divergence and the Hellinger distance are equivalent, we can explicitly determine . Note that the parameter only appears in the rest term and not on the penalty term . Therefore, we do not need to explicit to select a model. Technical details are discussed in Section B.4. of the Supplementary Material A.
We remark that the Hellinger risk is upper bounded by the Kullback-Leibler divergence in (7). For this reason, the result (7) is not exactly an oracle inequality and is called an oracle type inequality. However, the use of the Kullback-Leibler divergence and the Hellinger distance is common for model selection theorem for Maximum Likelihood Estimator: e.g. Theorem 7.11 in [Mas07]. Moreover, the Kullback-Leibler divergence is comparable to the Hellinger distance under some assumptions. Under these assumptions, the result (7) is exactly an oracle inequality.
The collection of models (2) is defined such that covariance matrices have bounded eigenvalues. These bounds are useful to control the complexity of each model by constructing a discretization of this space. Every constant involved in (7) depends on these bounds. This assumption is common in non-asymptotic model selection framework. However, the bounds are not tractable in practice. They are calibrated in practice based on the data using the slope heuristic, as discussed in Section 4.
To complete this analysis, we provide a second theoretical guarantee. In contrast with [Leb05, MM11], we strengthen the oracle type inequality using a minimax lower bound for the risk between the true model and the model selected. Note that in Gaussian Graphical Models, lower bounds have already been obtained in other contexts [BL08, CZZ10].
In Theorem 3.1, we have proved that we select a model as good as the oracle model in a density estimation framework. However, the bound has two extra terms: the penalty term and the rest . The two terms give the rate of the estimator. Based on Theorem 3.1 only, we do not know if the rate is as good as possible. The following theorem lower bounds the risk by a rate with the same form as the upper bound (seen as a function of , and ), which guarantees that we obtain an optimal rate.
Let . Consider the model defined in (1), and its dimension. Then, if we denote , for any estimator of one has
This theorem is proved in Supplementary Material A Section C. To obtain this lower bound, we use Birgé’s lemma [Bir05] in conjunction with a discretization of each model, already constructed to obtain the oracle type inequality.
Thanks to Theorems 3.1 and 3.2, we upper bound and lower bound the Hellinger risk. The two bounds can be compared if we neglect the Kullback-Leibler term (the bias term) and the rest . These terms are small if the collection of models is well constructed, i.e if the true density of the data is not too far from the constructed collection of models. E.g., if the true model belongs to the collection of models, the two terms equal zero. Thanks to Theorem 3.2, we say that the estimator satisfying (6) is minimax to . Moreover, the lower bound is obtained for a fixed , and the rate we obtain is minimax. We deduce that the performance of the procedure is as good as if we knew the structure. Consequently, our procedure is adaptive minimax, which is a strong theoretical result.
Note that the model selection procedure is optimized for density estimation and not for edge selection: the Hellinger distance and Kullback-Leibler divergence measure the differences between two densities from an estimation point of view. In contrast, network inference focuses on edge selection. However, we point out that the model selection procedure is only proposed in a specific context ( small), as a preliminary step (step A detailed in Section 2) prior to edge selection (step B). Although this preliminary step (step A) is not optimized for edge selection, it improves the network inference procedure as illustrated in simulated data (Section 4).
To conclude, let recall that these results are non-asymptotic, which means that they hold for a fixed sample size , which is particularly relevant in a context where the number of observations is limited. The results are consistent with the point of view adopted in this work.
4. Simulation study
In this section, we compare the performance of the proposed method described in Section 2 with the Cluster Graphical Lasso [TWS15] and the Graphical Lasso [FHT08]. We first compare the performance of the block-diagonal covariance structure detection, i.e. step A of the proposed method (Subsection 4.1), and then, compare the performance of the complete network inference methods (Subsection 4.2).
We simulate observations from a multivariate normal distribution with a null mean and a block-diagonal covariance matrix as defined in Section 2. We fix the number of variables , the sample size and the partition on variable with blocks of approximately equal sizes. For each block indexed by , we design the matrix as done in [GHV12]: where is a random lower triangular matrix with values drawn from a uniform distribution between -1 and 1, and is a diagonal matrix designed to prevent from having eigenvalues that are too small.
The Graphical Lasso is implemented in the R package glasso, version 1.7 [FHT08]. The Cluster Graphical Lasso is based on a hierarchical clustering implemented in the R package stats. To detect the partition of variables (step A(2) in Section 2), we find the connected components of the graph associated with the adjacency matrix using a simple breadth-first search implemented in the R package igraph [CN06]. The computation of the log-likelihood of each model (step A(3) in Section 2) is based on the R package mvtnorm implementing a Cholesky decomposition [GB09]. To calibrate the penalty in its simplified version (4) (step A(4) in Section 2), we use the R package capushe implementing the slope heuristic [BMM12]. The complete procedure has been implemented in an R package shock available as Supplementary Material: the package shock.
The practical aspects of the slope heuristic are detailed in [BMM12]: there are two methods to calibrate the penalty coefficient in (4). One calibration method is the Slope Heuristic Dimension Jump (SHDJ): the optimal coefficient is approximated by twice the minimal coefficient , where corresponds to the largest dimension jump on the graph representing the model dimension as a function of the coefficient . Another method is the Slope Heuristic Robust Regression (SHRR): the coefficient is approximated by twice , where corresponds to the slope of a robust regression performed between the log-likehood and the model dimension for complex models. The two methods are derived from the same heuristic and they offer two different visual checks of the adequacy of the model selection procedure to the data. They should select the same model. Note the calibration of the more complex version of the penalty is proposed and tested in Supplementary Material B. The source code of the R package and the code to reproduce the simulation experiments are provided in Supplementary Materials 1, 2 and 3.
4.1. Block-diagonal covariance structure detection
First, we investigate the ability to recover the simulated partition of variables based on the step A of the procedure described in Section 2. Illustrations of the calibration of the penalty coefficient are presented in Figure 1 for one simulated dataset. The code to reproduce the simulation experiment is provided in Supplementary Material 1: Figures 1. The largest dimension jump is easily detected on the graph representing the dimension of the model as a function of the coefficient (Figure 1 left). Likewise, we observe a linear tendency between the log-likehood and the model dimension for complex models (Figure 1 right) and easily fit a linear regression. Both calibration methods yield the same results.
In addition, we compare the partition selection methods with an average linkage hierarchical clustering with as proposed in the cluster graphical lasso [TWS15]. Figure 2 displays the Adjusted Rand Index (ARI) computed over 100 replicated datasets. The ARI measures the similarity between the inferred clustering and the simulated clustering [HA85]. The ARI equals 1 if the two partitions match. The code to reproduce the simulation experiment is provided in Supplementary Material 2: Figures 2 and 3. Despite the fact that the partition with the hierarchical clustering takes as an input parameter the true number of clusters (), the ARI for the hierarchical clustering is lower than the ARI for the two slope heuristic based methods (SHRR and SHDJ) which do not need to specify the number of clusters in advance.
4.2. Downstream network inference performance
To illustrate the potential advantages of prior block-diagonal covariance structure detection, we compare the following strategies for network inference over 100 replicated datasets:
- 1. Glasso::
We perform network inference using the graphical lasso on all variables, with regularization parameter chosen using the following criterion:
where the solution of the graphical lasso with regularization parameter , is the sample covariance matrix, and df the degree of freedom.
- 2. CGL::
We perform network inference using the cluster graphical lasso proposed in [TWS15]. First, the partition of variables is detected using an average linkage hierarchical clustering with clusters. Note that we set the number of clusters to the true number . Subsequently, the regularization parameters in each graphical lasso problem are chosen from Corollary 3 of [TWS15]: the inferred network in each block must be as sparse as possible while still remaining a single connected component.
- 3. Inference on partitions based on model selection::
First, we detect the partition using the two variants of our non-asymptotic model selection (SHRR ou SHDJ).
- (a) SHRR::
The partition is selected using the Slope Heuristic Robust Regression.
- (b) SHDJ::
The partition is detected using the Slope Heuristic Dimension Jump.
Subsequently, the regularization parameters in each graphical lasso problem are chosen using the criterion:
where is the solution of the graphical lasso problem restricted to the variables in block , is the sample covariance matrix on variables belonging to the block and df the corresponding degree of freedom.
- 4. Inference on the true partition of variables (truePart)::
First, we set the partition of variables to the true partition . Then, the regularization parameters in each graphical lasso problem are chosen using the criterion (10).
We compare the performance of the five methods using the sensitivity (), the specificity () and the False Discovery Rate (FDR) () where are respectively the number of true negative, true positive, false negative, false positive dependencies detected. A network inference procedure is a compromise between sensitivity and specificity: we are looking for a high sensitivity, which measures the proportion of dependencies (presence of edges) that are correctly identified, and a high specificity, which measures the proportion of independencies (absence of edges) that are correctly identified. The False Discovery Rate is the proportion of dependencies wrongly detected. The value of sensitivity, specificity and False Discovery Rate are computed over 100 replicated datasets as illustrated in Figure 3. The code to reproduce the simulation experiment is provided in Supplementary Material 2: Figures 2 and 3. As expected, the true partition strategy (truePart) performs the best: based on the true partition of variables, the network inference problem is easier because we solve problems of smaller dimension. The proposed strategies, based on the SHRR and SHDJ partitions, improve network inference compared to a simple graphical lasso on the set of all variables (glasso) or compared to the cluster graphical lasso (CGL).
5. Real data analysis
Pickrell et al. analyzed transcriptome expression variation from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals [Pic10]. The expression of 52580 genes across 69 observations was measured using RNA-seq. The data is extracted from the Recount database [FLL11]. After filtering weakly expressed genes using the HTSFilter package [RGCJ13], we identified the most variable genes among the 9191 remaining genes, and restrict our attention to this set of genes for the following network inference analysis. The code to reproduce the analysis is provided in Supplementary Material 3.
First, we select the partition using model selection as described in equation (4). The log-likelihood increases with the number of parameters to be estimated in the model as displayed in Figure 4. We notice a linear tendency in the relationship between the log-likelihood and the model dimension for complex models (points corresponding to a model dimension higher than 500). This suggests that the use of the slope heuristic is appropriate for selecting a partition . The model selected by SHDJ and by SHRR described in Section 2 are the same. The number of blocks detected is and the corresponding model dimension is . The partition yields 4 blocks of size and , 4 blocks of size 3, 2 blocks of size 2 and 140 blocks of size 1. The partition selected by the slope heuristic offers a drastic reduction of the number of parameters to infer, as compared with the graphical lasso performed on the full set of variables, which corresponds to a total of parameters to estimate.
The networks within each cluster of variables are inferred using the graphical lasso algorithm of Friedman [FHT08] implemented in the glasso package, version 1.7. The regularization parameter for the graphical lasso on the set of all variables is chosen using the criterion (9). The model inferred based on partition is more parsimonious and easier to interpret than the model inferred on the full set of variables. An illustration of inferred networks in the four largest connected components of the partition are displayed on Figure 5. These four networks might be good candidates for further study.
In this paper, we propose a non-asymptotic procedure to detect a block diagonal structure for covariance matrices in GGMs. Our non-asymptotic approach is supported by theoretical results: an oracle type inequality ensures that the model selected based on a penalized criterion is close to the oracle, i.e. the best model among our family. Moreover, we obtain a minimax lower bound of the risk between the true model and the model selected among the collection of models, which ensures that the model selection procedure is optimal, i.e. adaptive minimax to the block-structure.
The method we propose is easy to implement in practice and fast to compute. The calibration of the coefficient by robust regression and dimension jump encounters no particular difficulty. Moreover, graphical representations of the two slope heuristic calibration methods allow to visualize the calibration performance.
Note that our non-asymptotic results hold for a fixed number of sample, which is a typical case in many real applications. Although GGMs are widely used in practice, limited sample sizes typically force the user to restrict the number of variables. Usually, this restriction is performed manually based on prior knowledge on the role of variables. Here, our procedure allows to select relevant subsets of variables based on a data-driven criterion. This procedure is of great practical interest to estimate parameters in GGMs when the sample size is small. We apply it on a high-throughput genomic dataset but the procedure may be useful for other types of data with low sample size (e.g., neuroscience, sociology).
7. Appendix: Theoretical proofs
In this Supplementary Material A, we give details about the proofs of the theoretical results presented in the paper.
First, we recall the collection of models as defined in the article. In Section 7.1, we describe a discretization of the collection of models, which is useful to prove the oracle type inequality and the lower bound. In Section 7.2, we prove the oracle type inequality. In Section 7.3, we prove the minimax lower bound using Birgé’s lemma.
We recall the main notations as defined in the article. Let be a sample of of size . We consider the following collection of models, for the partition of variables into blocks,
where is the set of positive semidefinite matrices, and are real numbers, are the smallest and highest eigenvalues of and is a permutation matrix. We also note and the smallest and the largest values of (remark that and could be bounded by , but we use the notations and to simplify the reading).
7.1. Model collection and discretization
The aim of this section is to discretized the set . Our theoretical results rely on combinatorial arguments, such as an assumption on the bracketing entropy for the oracle type inequality and the Birgé’s lemma for the lower bound. Then, we develop in this section in details the discretization used later on.
We distinguish between low- and high-dimensional models. For low-dimensional models, we construct a thiner discretization, whereas in the high-dimensional models, the densities are closer and the situation is more complicate.
7.1.1. Discretization of the set of adjacency matrices for low-dimensional models
We denote by the adjacency matrix associated to a covariance matrix . Let . For a given matrix , we may identify a corresponding adjacency matrix . This matrix of size could be summarized by the vector of concatenated upper triangular vectors. Then, we construct a discrete space for which is in bijection with
First, we focus on the set .
Let be equipped with Hamming distance . Let be the subset of of vectors for which the corresponding graph has structure .
For every , let such that . There exists some subset of with the following properties:
where and .
Proof. Let be a maximal subset of satisfying property (13). Then the closed balls with radius whose belongs to cover . We remark that is a group action, isometric and transitive on .
for every , where .
Our proof is similar to the proof of Lemma 4.10 in [Mas07]. We consider:
Let . Let such that . Due to [Mas07], we prove that
with . It relies on exponential hypergeometric tail bounds.
Nevertheless, as , for ,
7.1.2. Discretization of the set of covariance matrices for low-dimensional models
In this section, we use Lemma 7.1 to deduce a discretization of the set of covariance matrices for low-dimensional models.
Let and such that . Let as constructed in Lemma 7.1, and its equivalent for adjacency matrices. Let . Let:
Proof. Let with . If and are close, either they have the same adjacency matrix and they differ only on a coefficient or they differ in their adjacency matrices. In the first case, . In the second case, . Then,
this minimum depending on and .
If , let as constructed in Lemma 7.1, and its equivalent for adjacency matrices. Let , and
7.1.3. Discretization of the set of covariance matrices for high-dimensional models
In this section, we deal with high-dimensional models. The following lemma and its proof could be found in [Mas07], Lemma 4.7.
Lemma 7.2 (Varshamov-Gilbert).
Let be equipped with Hamming distance . Given , there exists some subset of with the following properties:
We deduce a discretization of the set of covariance matrices for high-dimensional models.
If , let as constructed in Lemma 7.2, and its equivalent for adjacency matrices. Let , and let
Proof. As , we use the Varshamov-Gilbert’s lemma. With , , and by arguments similar to what we did before, it leads to the Proposition 2. ∎
7.2. Oracle inequality: proof of Theorem 3.1
Our oracle type inequality, Theorem 3.1, is deduced from a general model selection theorem for MLE. We state this theorem and its proof in Section 7.2.1, which is a generalization of Theorem 7.11 in [Mas07] for considering a random subcollection of a collection of models. Remark that this theorem could be useful in other applications, and is valid for other collections of models.
7.2.1. Model selection theorem for MLE among a random sub-collection
Let be a deterministic collection at most countable of models. We introduce the entropy with bracketing of the set . Recall that it is defined, for every positive , as the logarithm of the minimal number of brackets with diameter not larger than which are needed to cover and is denoted by .
In order to avoid measurability problems, we shall consider the following separability condition on the models.
For every , there exists some countable subset of and a set dense in such that for every , there exists some sequence of elements of such that for every , tends to as tends to infinity.
Let be an unknown density to be estimated from a sample of size . Consider some at most countable deterministic collection of models where for each , the elements of are assumed to be probability densities and fulfills Assumption 1. Let be some family of nonnegative numbers such that:
We assume that for every , is integrable in .
Moreover, for every , we assume that there exists on such that is nondecreasing, is nonincreasing on , and for every , for every , denoting by ,
Let be the unique positive solution of the equation .
Let , and for every , let such that:
Introduce some random sub-collection of .
Let . We consider the collection of -maximum likelihood estimators . Let , and let . We consider the minimizer of the penalized criterion
Then, there exists some absolute constants and such that wherever
for every , some random variable such that
exists and moreover, whatever the true density ,
This theorem is a generalization of Theorem in [Mas07] to a random subcollection of the whole collection of models. As the proof is adapted from the proof of this theorem, we detail here only differences and we refer the interested reader to [Mas07].
We denote by the empirical process and by the centered empirical process. Following the proof of the Massart’s theorem, easy computations lead to:
for and .
To bound , we use Massart’s arguments. The main difference stands in the control of . As is a random subcollection of models, . Nevertheless, thanks to Bernstein inequality, which we may use thanks to the inequality (17), we obtain that, for all , with probability smaller than ,
This extension has already been obtained by [MMR12]. We remark that this is a theoretically easy extension, but quite useful in practice, e.g. for controlling large collection of models.
We want to construct a family of nonnegative numbers such that
We need first to control the cardinal of , called the Bell number. Recall that is the set of all possible partitions of the variables. For this, we use a result of [BT10], which guarantees the following inequality.
Then, we obtain the following result, which defines the weights needed in Assumption (15).
Let . Then, .
Let . Remark that . Then, .
Let . Let : it could be written . Let and . According to Corollary 1, there exists