Large Covariance Estimation for Compositional Data via Composition-Adjusted Thresholding
High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data, and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some naive thresholding estimators that ignore the unique features of compositional data. We apply the proposed method to the analysis of a microbiome dataset in order to understand the dependence structure among bacterial taxa in the human gut.
Key words: Adaptive thresholding; Basis covariance; Centered log-ratio covariance; High dimensionality; Microbiome; Regularization.
Compositional data, which represent the proportions or fractions of a whole, arise naturally in a wide range of applications; examples include geochemical compositions of rocks, household patterns of expenditures, species compositions of biological communities, and topic compositions of documents, among many others. This article is particularly motivated by the metagenomic analysis of microbiome data. The human microbiome is the totality of all microbes at various body sites, whose importance in human health and disease has increasingly been recognized. Recent studies have revealed that microbiome composition varies based on diet, health, and the environment , and may play a key role in complex diseases such as obesity, atherosclerosis, and Crohn’s disease .
With the development of next-generation sequencing technologies, it is now possible to survey the microbiome composition using direct DNA sequencing of either marker genes or the whole metagenomes. After aligning these sequence reads to the reference microbial genomes, one can quantify the relative abundances of microbial taxa. These sequencing-based microbiome studies, however, only provide a relative, rather than absolute, measure of the abundances of community components. The counts comprising these data (e.g., 16S rRNA gene reads or shotgun metagenomic reads) are set by the amount of genetic material extracted from the community or the sequencing depth, and analysis typically begins by normalizing the observed data by the total number of counts. The resulting fractions thus fall into a class of high-dimensional compositional data that we focus in this article. The high dimensionality refers to the fact that the number of taxa may be comparable to or much larger than the sample size.
An important question in metagenomic studies is to understand the co-occurrence and co-exclusion relationship between microbial taxa, which would provide valuable insights into the complex ecology of microbial communities . Standard correlation analysis from the raw proportions, however, can lead to spurious results due to the unit-sum constraint; the proportions tend to be correlated even if the absolute abundances are independent. Such undesired effects should be removed in an analysis in order to make valid inferences about the underlying biological processes. The compositional effects are further magnified by the low diversity of microbiome data, that is, a few taxa make up the overwhelming majority of the microbiome .
Let be a composition of components (taxa) satisfying the simplex constraint
Owing to the difficulties arising from the simplex constraint, it has been a long-standing question how to appropriately model, estimate, and interpret the covariance structure of compositional data. The pioneering work of  introduced several equivalent matrix specifications of compositional covariance structures via the log-ratios of components. Statistical methods based on these covariance models respect the unique features of compositional data and prove useful in a variety of applications such as geochemical analysis. A potential disadvantage of these models, however, is that they lack a direct interpretation in the usual sense of covariances and correlations; as a result, it is unclear how to impose certain structures such as sparsity in high dimensions, which is crucial for our applications to microbiome data analysis.
Covariance matrix estimation is of fundamental importance in high-dimensional data analysis and has attracted much recent interest. It is well known that the sample covariance matrix performs poorly in high dimensions and regularization is thus indispensable.  and  introduced regularized estimators by hard thresholding for large covariance matrices that satisfy certain notions of sparsity.  considered a more general class of thresholding functions, and  proposed adaptive thresholding that adapts to the variability of individual entries. Exploiting a factor model structure,  proposed a factor-based method for high-dimensional covariance matrix estimation.  extended the work by considering a conditional sparsity structure and developed a POET method by thresholding principal orthogonal complements.
In this article, we address the problem of covariance estimation for high-dimensional compositional data. Let with for all be a vector of latent variables, called the basis, that generate the observed data via the normalization
Estimating the covariance structure of has traditionally been considered infeasible owing to the apparent lack of identifiability. By exploring a decomposition relating the compositional covariance to the basis covariance, we find, however, that the nonidentifiability vanishes asymptotically as the dimensionality grows under certain sparsity assumptions. More specifically, define the basis covariance matrix by
where . Then is approximately identifiable as long as it belongs to a class of large sparse covariance matrices.
The somewhat surprising “blessing of dimensionality” allows us to develop a simple, two-step method by first extracting a rank-2 component from the decomposition and then estimating the sparse component by thresholding the residual matrix. The resulting procedure can equivalently be viewed as thresholding the sample centered log-ratio covariance matrix, and hence is optimization-free and scalable for large covariance matrices. We call our method composition-adjusted thresholding (COAT), which removes the “coat” of compositional effects from the covariance structure. We derive rates of convergence under the spectral norm and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some naive thresholding estimators that ignore the unique features of compositional data. We illustrate our method by analyzing a microbiome dataset in order to understand the dependence structure among bacterial taxa in the human gut.
The covariance relationship, which was due to , has recently been exploited to develop algorithms for inferring correlation networks from metagenomic data . Our contributions here are to turn the idea into a principled approach to sparse covariance matrix estimation and provide statistical insights into the issue of identifiability and the impacts of dimensionality. Our method also bears some resemblance to the POET method proposed by  in that underlying both methods is a low-rank plus sparse matrix decomposition. The rank-2 component in our method, however, arises from the covariance structure of compositional data rather than a factor model assumption. As a result, it can be obtained by simple algebraic operations without computing the principal components.
The rest of the article is organized as follows. Section 2 reviews a covariance relationship and addresses the issue of identifiability. Section 3 introduces the COAT methodology. Section 4 investigates the theoretical properties of the COAT estimator in terms of convergence rates and support recovery. Simulation studies and an application to human gut microbiome data are presented in Sections 5 and 6, respectively. We conclude the article with some discussion in Section 7 and relegate all proofs to the Appendix.
2Identifiability of the Covariance Model
We first introduce some notation. Denote by , , , and the matrix -norm, spectral norm, Frobenius norm, and entrywise -norm, defined for a matrix by , , , and , where denotes the largest eigenvalue.
In the latent variable covariance model and , the basis covariance matrix is the parameter of interest. One of the matrix specifications of compositional covariance structures introduced by  is the variation matrix defined by
In view of the relationship , we can decompose as
or in matrix form,
where and . Corresponding to the many-to-one relationship between bases and compositions, the basis covariance matrix is unidentifiable from the decomposition , since and are in general not orthogonal to each other (with respect to the usual Euclidean inner product). In fact, using the centered log-ratio covariance matrix defined by
where is the geometric mean of a vector , we can similarly write
or in matrix form,
where and . Unlike , the following proposition shows that is an orthogonal decomposition and hence the components and are identifiable. In addition, by comparing the decompositions and , we can bound the difference between and its identifiable counterpart as follows.
Proposition ? entails that the covariance parameter is approximately identifiable as long as . In particular, suppose that belongs to a class of sparse covariance matrices considered by ,
where and denotes that is positive definite. Then
and hence the parameters and are asymptotically indistinguishable when . This allows us to use as a proxy for and greatly facilitates the development of new methodology and associated theory. The intuition behind the approximate identifiability under the sparsity assumption is that the rank-2 component represents a global effect that spreads across all rows and columns, while the sparse component represents a local effect that is confined to individual entries.
Also of interest is the exact identifiability of over -balls, which has been studied by  and . The following result provides a sufficient and necessary condition for the exact identifiability of by confining it to an -ball.
A counterexample is provided in the proof of Proposition ? to show that the sparsity conditions in  and , which are both at the order of , do not suffice. The identifiability condition in Proposition ? essentially requires the average degree of the correlation network to be less than 1, which is too restrictive to be useful in practice. This illustrates the importance and necessity of introducing the notion of approximate identifiability.
3A Sparse Covariance Estimator for Compositional Data
Suppose that , , are independent copies of , where the compositions are observed and the bases are latent. In Section 3.1, we rely on the decompositions and and Proposition ? to develop an estimator of , and in Section 3.2 discuss the selection of the tuning parameter.
In view of Proposition ?, we wish to estimate the covariance parameter via the proxy . To this end, we first construct an empirical estimate of and then apply adaptive thresholding to the estimate.
There are two equivalent ways to form the estimate of . Motivated by the decomposition , one can start with the sample counterpart of defined by
where and . A rank-2 component with can be extracted from the decomposition by projecting onto the subspace , which is given by
where and . The residual matrix , with entries
is then an estimate of . Alternatively, can be obtained directly as the sample counterpart of through the expression
where and .
Now applying adaptive thresholding to , we define the composition-adjusted thresholding (COAT) estimator
where is a general thresholding function and are entry-dependent thresholds.
In this article, we consider a class of general thresholding functions that satisfy the following conditions:
These two conditions were assumed by  and  along with another condition that is not required in our analysis. Examples of thresholding functions belonging to this class include the hard thresholding rule , the soft thresholding rule , and the adaptive lasso rule for .
The performance of the COAT estimator depends critically on the choice of thresholds. Using entry-adaptive thresholds may in general improve the performance over applying a universal threshold. To derive a data-driven choice of , define
where . We take to be of the form
where are estimates of , and is a tuning parameter to be chosen, for example, by cross-validation. We rewrite as , where . Then can be estimated by
3.2Tuning Parameter Selection
The thresholds defined by depend on the tuning parameter , which can be chosen through -fold cross-validation. Denote by the COAT estimate based on the training data excluding the th fold, and the residual matrix (or the sample centered log-ratio covariance matrix) based on the test data including only the th fold. We choose the optimal value of that minimizes the cross-validation error
With the optimal , we then compute the COAT estimate based on the full dataset as our final estimate. When the positive definiteness of the covariance estimate in finite samples is required for interpretation, we follow the approach of  and choose in the range where the minimum eigenvalue of the COAT estimate is positive.
In this section, we investigate the asymptotic properties of the COAT estimator. As a distinguishing feature of our theoretical analysis, we assume neither the exact identifiability of the parameters nor that the degree of (approximate) identifiability is dominated by the statistical error. Instead, the degree of identifiability enters our analysis and shows up in the resulting rate of convergence. Such theoretical analysis is rare in the literature, but is extremely relevant for latent variable models in the presence of nonidentifiability and is of theoretical interest in its own right. We introduce our assumptions in Section 4.1, and present our main results on rates of convergence and support recovery in Section 4.2.
Recall that , , and , and define . Without loss of generality, assume for all throughout this section. We need to impose the following moment conditions on the log-basis .
Conditions 1–3 are similar to those commonly assumed in the covariance estimation literature; see, for example, . Condition ? requires that the variables s be uniformly sub-Gaussian; the definition we use here is among several equivalent ways of defining sub-Gaussianity , and is most convenient for our technical analysis. Condition ? imposes some restrictions on the dimensionality and sparsity of the basis covariance matrix . It is worth mentioning that the sparsity level condition is so weak that it suffices to guarantee only approximate identifiability but allows the degree of nonidentifiability to be large relative to the statistical error. Condition ? is essential for methods based on adaptive thresholding. Condition ? arises from identifiability considerations in estimating the variances . In particular, if is multivariate normal, then Condition ? is implied by the assumptions and in Condition ?, since from Isserlis’ theorem  we have
We are now in a position to state our main results. The following theorem gives the rate of convergence under the spectral norm for the COAT estimator.
The rate of convergence provided by Theorem ? exhibits an interesting decomposition: the term represents the estimation error due to estimating , while the term accounts for the approximation error due to using as a proxy for . In particular, if the approximation error is dominated by the estimation error, then the COAT estimator attains the minimax optimal rate under the spectral norm over . It is important to note that the dimensionality appears in both terms where it plays opposite roles. We observe a “curse of dimensionality” in the first term, where the growth of dimensionality contributes a logarithmic factor to the estimation error. In contrast, a “blessing of dimensionality” is reflected by the second term in that a diverging dimensionality shrinks the approximation error toward zero at a power rate.
The insights gained from Theorem ? have important implications for compositional data analysis. In the analysis of many compositional datasets, the dimensionality often depends on the taxonomic level to be examined. For example, in metagenomic studies, the dimensionality may range from only a few taxa at the phylum level to thousands of taxa at the operational taxonomic unit (OTU) level. Suppose, for simplicity, that the magnitudes of correlation signals are of about the same order across different taxonomic levels. Then Theorem ? indicates a tradeoff between an accurate estimation of the covariance structure with low dimensionality and a sensible interpretation in terms of the basis components with high dimensionality. This tradeoff thus suggests the need to analyze compositional data at relatively finer taxonomic levels when a latent variable interpretation is desired.
The proof of Theorem ? relies on a series of concentration inequalities that take the approximation error term into account, which can be found in the Appendix. As a consequence of these inequalities, we obtain the following result regarding the support recovery property of the COAT estimator. Here the support of refers to the set of all indices with .
Theorem ? parallels the support recovery results in  and . However, owing to the extra term in the expression of , the assumption requires in addition that no correlation signals fall below the approximation error. In other words, exact support recovery will break down if any correlation signal is confounded by the compositional effect.
We conducted simulation studies to compare the numerical performance of the COAT estimator with that of the oracle thresholding estimator , which knew the latent basis components and applied the thresholding procedure to the sample covariance matrix of the log-basis . We also include in our comparison two naive thresholding estimators and , which are based on the sample covariance matrices of the composition and its logarithm , respectively. Note that is the ideal estimator that the COAT estimator attempts to mimic, whereas both and ignore the unique features of compositional data and thus are expected to perform poorly.
The data , , were generated as follows. We first generated in two different ways:
Then and were obtained through the transformations and , . Hence, in Case (i), and follow multivariate log-normal and logistic normal distributions , respectively; the distributions of and in Case (ii) can similarly be viewed as a type of multivariate log-gamma and logistic-gamma distributions.
In both cases, we took the components of randomly from the uniform distribution on , in order to reflect the fact that compositional data arising from metagenomic studies are often heterogeneous. The following two models for the covariance matrix were considered:
Model 1 is an extreme but illustrative case intended for comparing the distributions of spurious correlations under different transformations. The setting of Model 2 is typical in the covariance estimation literature and similar to that in . We set the sample size and the dimension , 100, and 200, and repeated 100 simulations for each setting.
The boxplots of sample correlations with simulated data under different transformations in Model 1 are shown in Figure 1. Clearly, the sample centered log-ratio (clr) correlations are centered around zero and have a similar distribution to that of the sample correlations of ; the resemblance tends to increase as the dimension grows. This trend is consistent with Proposition ? and provides numerical evidence for the validity of the centered log-ratio covariance matrix as a proxy for . In fact, from the proof of Proposition ? we have, when ,
In contrast, the phenomenon of spurious correlations is observed on both and . The sample correlations of exhibit a severe upward bias, while the sample correlations of contain many outliers that would be detected as signals by a thresholding procedure with threshold level close to 1. Moreover, the spurious correlations seem to become worse with gamma-related distributions where the components of the composition have more heterogeneous means.
We applied the COAT method with hard and soft thresholding rules to simulated data in Model 2. For comparison, we also applied the thresholding procedure to the sample covariance matrices of , , and , resulting in the estimators , , and , respectively. The tuning parameter in each thresholding estimator was chosen by tenfold cross-validation. Losses under the matrix -norm, spectral norm, and Frobenius norm were used to measure the estimation performance, while the true positive rate and false positive rate were employed to assess the quality of support recovery.
The simulation results for Model 2 with normal- and gamma-related distributions are summarized in Tables ? and ?, respectively. We see that the COAT estimator performs almost equally well as the ideal estimator , and outperforms the naive thresholding estimators and by a large margin. In particular, the estimation losses of are disastrously large in the gamma setting, in agreement with the severe bias observed in Figure 1. The estimation losses of do not change much across different thresholding rules and distributions, since all entries of the estimate are very small relative to the true values. Both and show inferior performance in terms of true and false positive rates, indicating that they are not model selection consistent. Comparisons between hard and soft thresholding rules suggest that the former is more conservative in selecting false positives and results in a more parsimonious model, whereas the latter strikes a balance between true and false positives due to the shrinkage effect.
To further compare the support recovery performance without selecting a threshold level, we plot the receiver operating characteristic (ROC) curves for all methods in Figure 2. Note that hard and soft thresholding rules lead to the same ROC curve for each method. We observe that the ROC curves for and are almost indistinguishable and uniformly dominate those for and , demonstrating the superiority of the COAT method. Of the two naive thresholding estimators, tends to outperform when the threshold level is high, since the former is less influenced by the high spurious correlations as reflected in Figure 1.
6Gut Microbiome Data Analysis
The gut microbiome plays a critical role in energy extraction from the diet and interacts with the immune system to exert a profound influence on human health and disease. Despite an emerging interest in characterizing the ecology of human-associated microbial communities, the complex interactions among microbial taxa remain poorly understood . We now illustrate the proposed method by applying it to a human gut microbiome dataset described by , which was collected from a cross-sectional study of 98 healthy individuals at the University of Pennsylvania. DNA from stool samples of these subjects were analyzed by 454/Roche pyrosequencing of 16S rRNA gene segments, resulting in an average of 9265 reads per sample, with a standard deviation of 3864. Taxonomic assignment yielded 3068 operational taxonomic units, which were further combined into 87 genera that appeared in at least one sample. Demographic information, including body mass index (BMI), was also collected from the subjects. We are interested in identifying and comparing the correlation structures among bacterial genera between lean and obese subjects. We therefore divided the dataset into a lean group (, ) and an obese group (, ), and focused on the bacterial genera that appeared in at least four samples in each group. The count data were transformed into compositions after zero counts were replaced by 0.5.
We applied the COAT method with the soft thresholding rule to each group, and used tenfold cross-validation to select the tuning parameter. The resulting estimate was represented by a correlation network among the bacterial genera with each edge representing a nonzero correlation. To assess the stability of support recovery, we further generated 100 bootstrap samples for each group and repeated the thresholding procedure on each sample. The stability of the correlation network was measured by the average proportion of edges reproduced by each bootstrap replicate. Finally, we retained only the edges in the correlation network that were reproduced in at least 80 bootstrap replicates. The numbers of positive and negative correlations and the stability of correlation networks are reported in Table 1; the results for the two naive thresholding estimators and are also included for comparison. We see that the COAT method achieves the highest stability among the three methods and has the most edges passing the stability test. The correlation network identified by has substantially fewer negative correlations than the other two methods, which is likely due to the severe upward bias observed in Figure 1. The correlation network identified by is the least stable.
The correlation networks identified by the COAT method for the two groups are displayed in Figure ?. Clearly, the networks for the lean and obese groups show markedly different architecture, indicating that the obese microbiome is less modular with less complex interactions between the modules. This phenomenon has been demonstrated by previous studies and is possibly due to adaptation of the microbiome to low-diversity environments . Table 1 and Figure ? also suggest that the gut microbial network tends to contain more competitive (negative) interactions than cooperative (positive) ones, which seems consistent with the recent finding that the ecological stability of the gut microbiome can be attributed to the benefits from limiting positive feedbacks and dampening cooperative networks .
A closer inspection of the correlation networks identifies Bacteroides and Prevotella as two key genera of the gut microbiome. The abundances of these two genera are well known to distinguish two gut microbial enterotypes, which are strongly associated with long-term dietary patterns . The negative correlations between Bacteroides and Prevotella ( in the lean group and in the obese group) are well explained by the diet-dependent enterotypes and the within-body separation of the two genera . Moreover, recent studies have suggested several keystone species belonging to the genus Bacteroides, through which the structure of gut microbial communities may be influenced by small perturbations . Also, the Firmicutes-enriched microbiome has been found to hold greater metabolic potential than the Bacteroidetes-enriched microbiome for more efficient energy harvest from the diet . Figure ? seems to support these findings, in view of the central position of Bacteroides in the networks and its strong correlations with a few genera belonging to the Firmicutes. Such patterns, however, are less clearly seen in the correlation networks identified by the other two methods.
Understanding the dependence structure among microbial taxa within a community, including co-occurrence and co-exclusion relationships between microbial taxa, is an important problem in microbiome research. Such structures provide biological insights into the community dynamics and factors that change the community structures. To overcome the difficulties arising from the unit-sum constraint of the observed compositional data, we have developed a COAT method to estimate the sparse covariance matrix of the latent log-basis components. Our method is based on a decomposition of the variation matrix into a rank-2 component and a sparse component. The resulting procedure is equivalent to thresholding the sample centered log-ratio covariance matrix, and thus is optimization-free and scalable for high-dimensional data.
Our simulation results demonstrate that the COAT method performs almost as well as the oracle thresholding estimator that knew the latent basis components, and outperforms some naive thresholding estimators by a large margin. These improvements are more pronounced when the basis components have a skewed distribution, as is often observed in microbiome studies. In the application to gut microbiome data, the COAT method leads to more stable and biologically more interpretable results for comparing the dependence structures of lean and obese microbiomes.
We have provided conditions for the approximate and exact identifiability of the covariance parameters, and have established rates of convergence and support recovery guarantees for the COAT estimator. The rate of convergence includes an extra term of in addition to the usual minimax optimal rate of convergence for sparse covariance estimation. The extra term represents an approximation error due to using as a proxy for , which vanishes under mild assumptions as the dimensionality increases.
The proposed methodology may be extended in several ways. First, it would be possible to develop a joint optimization procedure based on the decomposition . For example, one may consider the regularized estimator
where and is a sparsity-inducing penalty function. The COAT estimator can be viewed as a one-step approximation to with appropriately chosen penalty function and initial value . Solving the full optimization problem is computationally more expensive but is expected to improve on the performance of the COAT estimator. Another worthwhile extension would be to deal with zero counts directly. One may, in principle, combine the ideas presented here with models that account for sampling and structural zeros. The issues of identifiability and computational feasibility are the major concerns with such extensions.
a.1Proof of Proposition
Using the fact that the centered log-ratio covariance matrix is symmetric and has all zero row sums , we have
that is, the components and are orthogonal to each other.
To show the desired inequality, by the identity (4.35) of , we have
a.2Proof of Proposition
We first claim that if , then the matrix has at least nonzero upper-triangular entries. To prove this, without loss of generality, assume and that the last entries of the first row of are zero, where ; that is, for , and . The latter implies , which gives rise to nonzero entries at positions with . Putting these pieces together, we obtain that the number of nonzero upper-triangular entries in is at least
To show that the lower bound is not attainable, note that if there are only nonzero upper-triangular entries, then or 2, and we have , which implies . Since , this gives rise to at least one nonzero entry at positions with , which is a contradiction.
Now suppose and that and in lead to , that is,
Note that the right-hand side has fewer than nonzero upper-triangular entries. Then it follows from the above claim that .
We prove the other direction by showing that, if , then there exist and in with that lead to . Indeed, let
where , , and . Then it is easy to verify that
This completes the proof.
To prepare for the proofs of Theorems 1 and 2, we first establish some useful concentration inequalities. For notational simplicity, the constants below may vary from line to line.
To prove , let and . Note first that, by Condition ? and the sub-Gaussian tail bound, for any and ,
which is less than if we choose sufficiently large. Then we have
By Hoeffding’s inequality and the union bound,
Also, by Condition 1 and the sub-Gaussian tail bound,
Combining both terms, choosing with , and noting , we arrive at
for some . This proves and completes the proof.
We first prove . Define
where . We then write
Note that, by definition, , where . Define , where . Since are uniformly sub-Gaussian by Condition ?, are also uniformly sub-Gaussian. Using a truncation argument similar to that for proving , we can show that
for some . The sub-Gaussian tails imply also that . Combining these two pieces yields
It follows from Lemma ? that
The above two inequalities together imply
We can similarly bound the other terms in and obtain
To bound the term , we further write
Consider the event on which
Then, on , we have
To bound the next term in , we write
which, on and by Condition ?, is bounded by . We can similarly bound the other terms in and obtain, on ,
To bound the term , note that