Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the understanding of important differences among subpopulations of interest. In this paper, we exploit multiple networks with Gaussian graphs to encode the connectivity patterns of a large number of features on the subpopulations. To uncover the heterogeneity of these structures across subpopulations, we suggest a new framework of tuning-free heterogeneity pursuit (THP) via large-scale inference, where the number of networks is allowed to diverge. In particular, two new tests, the chi-based test and the linear functional-based test, are introduced and their asymptotic null distributions are established. Under mild regularity conditions, we establish that both tests are optimal in achieving the testable region boundary and the sample size requirement for the latter test is minimal. Both theoretical guarantees and the tuning-free feature stem from efficient multiple-network estimation by our newly suggested approach of heterogeneous group square-root Lasso (HGSL) for high-dimensional multi-response regression with heterogeneous noises. To solve this convex program, we further introduce a tuning-free algorithm that is scalable and enjoys provable convergence to the global optimum. Both computational and theoretical advantages of our procedure are elucidated through simulation and real data examples.
Tuning-Free Heterogeneity Pursuit ]Tuning-free heterogeneity pursuit in massive networks ††thanks: This work was supported by NSF CAREER Awards DMS-0955316 and DMS-1150318 and a grant from the Simons Foundation. Part of this work was completed while the last two authors visited the Departments of Statistics at University of California, Berkeley and Stanford University. These authors sincerely thank both departments for their hospitality.
Keywords: Heterogeneous learning; Large-scale inference; Multiple networks; Scalability; Heterogeneous group square-root Lasso; Efficiency; Sparsity; High dimensionality
In the era of data deluge one can easily collect a massive amount of data from multiple sources, each of which may come from a certain subpopulation of a larger population of interest. For example, these subpopulations can represent different cancer types, brain disorders, or product choices. A large number of features are often associated with each subject. Understanding the heterogeneity in the association structures of these features across subpopulations can be important in empowering meaningful scientific discoveries or effective personalized choices in our lives. Meanwhile heterogeneity in the data also poses new challenges to effective learning and calls for new developments of methods, theory, and algorithms with scalability and statistical efficiency.
Heterogeneity can take different forms in various applications such as the differences among the sparsity patterns or link strengths over multiple networks, and the differences in noise levels or distributions over multiple subpopulations. To avoid potential ambiguity, we would like to make it explicit that throughout this paper, we focus only on two particular types of heterogeneity which are the heterogeneity in sparsity patterns and the heterogeneity in noise levels. To approach the problem of heterogeneous learning in these contexts, we exploit the model setting of multiple networks with Gaussian graphs each of which encodes the connectivity pattern among features for each subpopulation. The edges of these networks are characterized by the inverse covariances for each pair of nodes from a subpopulation. The focus on this particular type of network models enables us to present our main idea with technical brevity. See, for example, Teng (2016) for an account of more general network models beyond graphical models. In fact, as a popular choice of network models Gaussian graphical models involving the inverse covariances have been used widely in applications to characterize the conditional dependency structure among variables. In such models, the joint distribution of variables is modeled by a multivariate Gaussian distribution , where the matrix is called the precision matrix or inverse covariance matrix of these variables. A basic fact is that each pair of variables, and , are conditionally independent given all other variables if and only if the th entry of the precision matrix is zero. The conditional dependency structure in a Gaussian graph is therefore determined completely by the associated precision matrix . See, for instance, Lauritzen (1996) and Wainwright and Jordan (2008) for more detailed accounts and applications of these models.
There is a growing literature on Gaussian graphical models. Much recent attention has been given to the problem of support recovery and link strength estimation, which focuses on identifying the nonzero entries of the precision matrix and estimating their strengths. Among those endeavors, a majority of the work has focused primarily on the case of a single Gaussian graphical model; see, for example, Meinshausen and Bühlmann (2006); Yuan and Lin (2007); Friedman et al. (2008); Fan et al. (2009); Yuan (2010); Cai et al. (2011); Ravikumar et al. (2011); Liu (2013); Zhang and Zou (2014); Ren et al. (2015); Fan and Lv (2015), among many others. A common feature of this line of work is that the data is assumed to be homogeneous with all observations coming from a single population. More detailed discussions and comparisons of these methods can be found in, for instance, Ren et al. (2015) and Fan and Lv (2015). Yet as mentioned before heterogeneity in the data can be prevalent in many contemporary applications involving massive data. The existing methods for analyzing data from each individual source become insufficient due to the assumption of homogeneity. Naively combining the results from these individual analyses may also yield suboptimal performance of statistical estimation and inference.
The setting of multiple networks with Gaussian graphical models has gained more recent attention. A lot of work assumes a time-varying graphical structure across different graphs. In particular, one assumes that there is a natural ordering of the graphs and the parameters of interest vary smoothly according to this order. For these developments, some smoothing techniques such as the kernel smoothing are key to the construction of the estimators as well as the analysis of their theoretical properties. While the time-varying graphical model is not the focus of our current paper, one may refer to, for example, Zhou et al. (2010); Kolar et al. (2010); Chen et al. (2013); Qiu et al. (2016); Lu et al. (2015) for more details on this line of work.
In contrast, our setting of multiple networks with Gaussian graphical models is along another line that makes no assumption on the ordering of the graphs. In this line of work, the main assumption is a common sparsity structure across different graphs. In particular, the estimators proposed in Guo et al. (2011), Danaher et al. (2014), and Zhu et al. (2014) employ the approach of penalized likelihood with different choices of the penalty function, while the MPE method introduced in Cai et al. (2016) takes a weighted constrained and minimization approach, which can be seen as an extension of the CLIME estimator for a single graph (Cai et al., 2011). A common feature of such existing work is the focus on the problem of support recovery and link strength estimation. Moreover, by the nature of these methods their computational cost increases drastically with both the dimensionality and the number of graphs, which can limit their practical use in analyzing massive data sets. How to develop a scalable procedure for large-scale inference in the setting of multiple Gaussian graphical models still remains largely open.
To uncover the heterogeneity of the connectivity patterns among features across subpopulations and address the aforementioned challenges, in this paper we suggest a new framework of tuning-free heterogeneity pursuit (THP) via large-scale inference, where the number of networks is allowed to diverge and the number of features can grow exponentially with the number of observations. Distinct from the existing methods, our procedure identifies the heterogeneity in sparsity patterns among a diverging number of graphs by testing the common sparsity structure of these Gaussian graphs. Specifically, we are interested in testing the null hypothesis
associated with the joint link strength vector for each pair of variables with , where with denotes the precision matrix associated with the th graph. To approach the inference problem in (1), we propose two new tests, named the chi-based test and the linear functional-based test, for two different scenarios. The former test which is for the general scenario requires no extra information from the graphs and is shown to perform well as long as the norm of the joint link strength vector is large. The chi-based test is named after the property that the null distribution of this test statistic is shown to converge to the chi distribution. The latter one relies on some extra information on the signs of . Such extra information is indeed available in some applications. For example, in some genome-wide association studies (GWAS) it was discovered that the association structures can be portable between certain subpopulations (Marigorta and Navarro, 2013). In such scenario, the linear functional-based test can be constructed and shown to perform well when the norm of the vector becomes large.
An interesting feature of both tests is that each of them is established under mild regularity conditions to be optimal in the sense of achieving the testable region boundary, where the testable region boundary is defined as the smallest signal strength below which no test is able to detect if the observations are from the null hypothesis against the alternative hypothesis and above which some test can distinguish successfully between the two hypotheses. We further show that for the linear functional-based test, the sample size requirement is in fact minimal. A natural question is whether naively combining the tests constructed from individual graphs might suffice. Our theoretical results provide insights into this question and demonstrate the advantages of our tests in terms of weaker sample size requirement than the naive combination approach. We also would like to mention that although the main focus of our paper is on hypothesis testing, our procedure can be modified easily by introducing an additional thresholding step for support recovery; see Section 2.5 for detailed discussions and comparisons with existing approaches. In particular, our modified procedure achieves successful support recovery under milder sample size assumption than many existing methods. To the best of our knowledge, the testing of multiple networks with graphs and the optimality study are both new to the literature.
The challenges of heterogeneous learning in the setting of multiple networks are rooted on the inference with efficiency, the scalability, and the selection of tuning parameters which is often an implicit bottleneck of existing methods. Our THP framework addresses all these challenges in a harmonious fashion. Both theoretical guarantees and the tuning-free feature are enabled through efficient multiple-network estimation by our newly suggested approach of heterogeneous group square-root Lasso (HGSL) in the setting of high-dimensional multi-response regression with heterogeneous noises. More specifically, we reduce the problem of estimating graphs simultaneously to that of running multi-response regressions with heterogeneous noises. This new formulation allows us to borrow information across graphs when estimating their structures, which results in improved rates of convergence. To solve the convex programs from these multi-response regressions, we introduce a new tuning-free algorithm that is scalable and admits provable convergence to the global optimum. Compared to existing methods in the literature, our new procedure enjoys four main advantages. First, it is justified theoretically that our HGSL estimators have faster rates of convergence. Second, the HGSL method is capable of handling heterogeneous noises, the presence of which causes intrinsic difficulty for developing a tuning-free procedure. Third, our new algorithm is simple and tuning free, and scales up easily. Fourth, we provide theoretical justification on the convergence of the tuning-free algorithm.
The rest of the paper is organized as follows. Section 2 introduces the THP framework for heterogeneous learning in multiple networks via large-scale inference with the chi-based test and the linear functional-based test, and establishes their optimality properties. We present the HGSL approach along with a tuning-free algorithm for fitting high-dimensional multi-response regression with heterogeneous noises, and provide the estimation and prediction bounds for the estimator as well as a convergence analysis for the algorithm in Section 3. Section 4 details several numerical examples of simulation studies and real data analysis. We discuss some extensions of the suggested method to a few settings in Section 5. The proofs of all the results and technical details are provided in the Supplementary Material.
2 Tuning-free heterogeneity pursuit in multiple networks via large-scale inference
2.1 Model setting
As mentioned in the Introduction, we adopt the setting of multiple networks with Gaussian graphical models to encode the connectivity patterns among features measured on subpopulations of a general population, which yields classes of data. In this model, for each class the -dimensional feature vector follows a multivariate Gaussian distribution
where the superscript means that these features are measured on the th subpopulation and is the precision matrix of the th class. In addition, the distributions of are assumed to be independent. Each of the precision matrices reflects the conditional dependency structure among the features . In the high-dimensional setting where the dimensionality can be very large compared to the sample size, it is common in many applications such as genomic studies to assume that each precision matrix has certain sparsity structure. The goals in these studies include the estimation of precision matrices and the inference on their entries .
When there is only one class of data, that is, , our setting coincides with that of single Gaussian graphical model. For the general case of multiple graphs with , it can be beneficial to borrow the strength across all classes of data to achieve more accurate estimation of the precision matrices if the classes are related to each other. With this spirit, we assume that the classes share some similar sparsity structure, and the heterogeneity in sparsity patterns captures the differences among these graphical structures. In particular, we are interested in the scenario where for each pair of nodes with , either simultaneously for all or alternatively the joint link strength vector is significantly different from the zero vector. Throughout the paper we denote by
the edge set corresponding to the graphs given in model (2.1).
The main goal of our paper is to develop an effective and efficient procedure for testing the null hypothesis defined in (1) for multiple networks, which provides an inferential approach to uncovering the heterogeneity of the feature association structures across the subpopulations. Depending on the type of the alternative hypothesis, we will introduce two different fully data-driven test statistics and establish their advantages over those obtained by naively combining the tests constructed from each individual graph.
2.2 Chi-based test
We begin with introducing the first test for our THP framework in multiple networks. To ease the presentation, we introduce some compact notation. Denote by the subvector of a vector with the th component removed, and for any matrix denote by its th column, the subvector of with the th component removed, and the submatrix of with the th column removed. Our testing idea is based on a simple observation that for each , the conditional distribution of given all remaining variables in class follows the Gaussian distribution
with the -dimensional coefficient vector . Based on the distributional representation in (2.2), one can see that the error random variables are independent across and follow the distribution . Moreover, it holds for each pair of nodes with that
The key representation in (2.2) entails that accurate estimators of with can be constructed if one can estimate , , and well.
Another important observation is that under the null hypothesis in (1), the conditional distributions of the classes with indeed share similar sparsity structure on the coefficient vectors thanks to the representation . In fact, it is clear that for all under , where is the component of vector corresponding to variable . This observation suggests that we can borrow information from different graphs when testing the joint sparsity structure of multiple graphs. Motivated by such observation, we turn the problem of multiple-network estimation into that of high-dimensional multi-response linear regression
for . A distinct feature of the above multi-response regression model (2.2) is that it has heterogeneous noises since generally varies over .
As mentioned before, we also have the group sparsity structure of the regression coefficient vector in model (2.2). More specifically, denote the -dimensional subvector of corresponding to the th group by
Then we see that for all pairs , the complement of defined in (2.1). We will suggest in Section 3 an efficient estimation procedure that utilizes the group sparsity structure in the regression coefficients and also accounts for the heterogeneity in the noises in model (2.2).
From now on we work with a sample from model (2.1) that is comprised of independent and identically distributed (i.i.d.) observations for each class , where and the observations across different classes are independent. Suppose that we have some initial estimator for the -dimensional regression coefficient vector , whose construction is detailed in Section 3. Then the random errors for each can be estimated by the residuals
with and . In view of the representation in (2.2), we can estimate associated with the noise level of class as . In contrast, the estimation of with is slightly more complicated. To estimate the negative covariance , we exploit the following bias corrected statistic
Observe that the first term on the right-hand side of (2.2) corresponds to the sample covariance of the residuals from variables and . When , this sample covariance is asymptotically unbiased in estimating . Such sample covariance is, however, biased in the case of and thus two additional terms are introduced for in (2.2) to correct the bias. Indeed, we can show that after the bias correction the statistic is asymptotically close to the statistic
which is in turn asymptotically close to the negative covariance .
When there is only a single graph, that is, , the above statistic in (2.2) reduces to the one introduced in Liu (2013) to address the bias issue in the testing for a single Gaussian graph. In the scenario of multiple graphs, we observe a similar phenomenon and provide in Theorem 1 later a formal theoretical justification. It is worth mentioning that the key estimators and introduced above are constructed using the residuals instead of the estimated regression coefficients , though the regression coefficients are also closely related to the entries of the precision matrix . The main advantage of using residuals over coefficients is rooted on the fact that obtaining asymptotically unbiased estimates of the latter is much more challenging in high dimensions, largely due to the well-known bias issue associated with the regularization methods, than accurately estimating the former, which is closely related to the prediction problem.
The new formulation in (2.2) not only allows us to solve the problem of multiple-graph estimation efficiently through multi-response regressions as detailed in Section 3, but also enables us to construct new tests that are more powerful than existing methods by borrowing information from different graphs. We are now ready to present the first such test. Due to the group sparsity structure and the target of our null hypothesis in (1), we naturally construct our test statistics using certain functions of all statistics in (2.2) with . Thanks to the joint estimation accuracy for the -dimensional regression coefficient vector , we define our first test statistic, the chi-based test statistic , as
for testing the null hypothesis against the alternative hypothesis for which the condition is imposed on the norm . In other words, our test statistic is powerful whenever the signal strength is larger than some testable region boundary, which will be characterized later in Section 2.4.
To characterize the limiting distribution of the chi-based test statistic in (2.2) under the null, we introduce two additional statistics and as
where is the random error and is the oracle estimator of since the random error vector is unobservable in practice. It is interesting to observe that under the null, the Gaussian vector is independent of , which entails that and they are independent of each other over . Consequently, under the null hypothesis in (1) it holds that .
Before formally presenting our first main result, we introduce the following two regularity conditions on our model (2.1).
There exists some constant such that for each , where and denote the smallest and largest eigenvalues of a matrix.
It holds that with , where means the same order, , and is some positive constant.
The well-conditionedness of the precision matrices assumed in Condition 1 simplifies our technical presentation. For simplicity, we also assume in Condition 2 that our sample is balanced with the sample sizes of each of the classes comparable to each other. With slight abuse of notation, we denote by this common level whenever the rate is involved. We proceed with introducing additional notation and technical conditions. Denote by and the estimation errors of and , respectively, with the -dimensional subvector defined in a similar way to in (2.2). To characterize the sparsity level, we define the joint sparsity of the networks as the maximum node degree corresponding to the edge set in (2.1),
We further assume that with high probability the initial estimator satisfies
Indeed, the new tuning-free approach of HGSL suggested in Section 3 guarantees that we can obtain initial estimators each satisfying all these properties (2.2)–(2.2) with probability at least for some positive constants and . A distinct feature is that the analysis of our tuning-free estimator is new due to the heterogeneity of noises across different classes, which makes typical tuning-free procedures such as the scaled Lasso (Sun and Zhang, 2012) and the square-root Lasso (Belloni et al., 2011) no longer work in the current setting; see Section 3 for more detailed discussions.
Assume that Conditions 1–2 hold, the initial estimators each satisfy properties (2.2)–(2.2) with probability at least , , and for some constants and . Then for each pair with , it holds with probability at least that
where is some constant. Moreover, under null hypothesis in (1) we have and with the same probability bound that
The coupling result in Theorem 1 motivates us to propose the chi-based test defined as
for our THP framework in multiple networks which tests the null hypothesis in (1) using the test statistic given in (2.2), where is a fixed significance level and denotes the th percentile of the chi distribution with degrees of freedom . The name of this test is from the property that the null distribution of the test statistic is asymptotically close to the chi distribution.
As formally justified in Proposition 1, the chi-based test introduced in ( ‣ 2.2) is indeed an asymptotic test with significance level under the sample size requirement of , in the asymptotic setting in which the number of nodes , the number of networks , and the joint sparsity of the networks can diverge simultaneously as the common level of sample sizes .
2.3 Linear functional-based test
The chi-based test introduced in Section 2.2 serves as a general procedure to test whether the joint link strength vector is zero when there is no additional information assumed on the networks. In some scenarios when certain extra knowledge is available, it is possible to design more powerful testing procedures. In this spirit, we now present an alternative test for our THP framework in multiple networks based on a linear functional of , which is closely related to its norm. The main motivation is that in some applications such as the GWAS example mentioned in the Introduction (Marigorta and Navarro, 2013), the sign relationship of some target edge across graphs is provided implicitly or explicitly. For example, one may expect that all the with share the same sign, that is, they are either all nonpositive or all nonnegative. In such scenario, testing the null hypothesis is equivalent to testing . In a more general setting, the sign relationship can be represented by a unique sign vector , up to a single sign, such that or , and thus the null hypothesis takes an equivalent form of .
Given the above sign vector , we define our second test statistic, the linear functional-based test statistic , as
with the bias corrected statistic given in (2.2). To characterize the limiting distribution of the linear functional-based test statistic under the null, we introduce another statistic as
where the statistic is given in (2.2). With the extra sign information, our new test statistic is powerful whenever the signal strength becomes large; see Section 2.4 for the characterization of the testable region boundary under the alternative hypothesis for which the condition is imposed on the norm . It is easy to see that under the null, are independent of each other over , and consequently for any given sign vector .
Theorem 2 quantifies the asymptotic behavior of the linear functional-based test statistic under the null hypothesis in (1). Assume further that the sign vector is given uniquely such that under the alternative hypothesis. Then Theorem 2 and the definition of the statistic in (2.2) motivate us to propose a one-sided test, the linear functional-based test , defined as
for our THP framework in multiple networks, where is a fixed significance level and stands for the th percentile of the standard Gaussian distribution. When the sign vector is given up to a single sign, for example, when we know only that all the signs with are identical, it is more natural to define a two-sided test. We omit the details of such two-sided test for simplicity.
Proposition 2 which is based on Theorem 2 shows that the linear functional-based test introduced in ( ‣ 2.3) is indeed an asymptotic test with significance level under the sample size requirement of . It is worth mentioning that most existing results in the literature either focus on testing procedures for a single graph or develop estimation procedures for multiple graphs without statistical inference in high dimensions. In contrast, our developments in Theorems 1–2 and Propositions 1–2 provide procedures of large-scale inference in multiple graphs for the first time. For the case of a single graph with , our test statistics essentially reduce to the one introduced in Liu (2013). This suggests an alternative way of constructing test statistics, which is to construct a test statistic for each individual graph as in Liu (2013) and then naively pool them together in the same way as for our tests and .
Let us gain some insights into our tests with a comparison to the above naive combination procedure. The advantage of our linear functional-based test is reflected on the sample size requirement of established in Proposition 2, thanks to the information of structural similarity across the graphs which makes the working assumptions (2.2)–(2.2) possible. In comparison, to test the null hypothesis one can also apply the procedure in Liu (2013) to each of the graphs and then construct a similar linear functional-based test as in ( ‣ 2.3). For such naive combination procedure, it can be shown that a stronger sample size assumption is required. In fact, we further establish in Section 2.4 that the sample size requirement for our linear functional-based test is minimal in a decision theoretic framework.
Similarly the advantage of our chi-based test is rooted on the sample size requirement of obtained in Proposition 1. In contrast, one can also construct a similar chi-based test as in ( ‣ 2.2) based on the residuals which are obtained through an application of the procedure in Liu (2013) to each individual graph. For such naive combination testing procedure, it can be shown that the sample size assumption is required. This demonstrates that in a range of typical scenarios when the number of networks does not grow excessively fast with , our chi-based test indeed has a weaker sample size requirement.
2.4 Optimality of tests and minimum sample size requirement
So far we have introduced our THP framework in multiple networks with two different types of tests for testing the null hypothesis in (1). The constructions of our test statistics are motivated by the possible alternative hypothesis. In particular, the chi-based test should be powerful as long as the joint link strength is away from zero, while the linear functional-based test will be powerful when the signs of are known and becomes large. Along this direction, we now further investigate two types of composite alternative hypotheses. We define the set of all -sparse multiple networks as
where stands for the set of precision matrices with slight abuse of notation and is some positive integer. Then the null hypothesis in (1) can be rewritten as
In particular, we consider the following two alternative hypotheses
where the former is introduced to investigate the chi-based test , the latter is for the linear functional-based test , and .
It is clear that the difficulty of testing the null in ( ‣ 2.4) against the alternative in ( ‣ 2.4) or against the alternative in ( ‣ 2.4) depends critically on the quantity . The smaller is, the more difficult to distinguish between the null and alternative hypotheses. A natural and fundamental question is what the boundary of the testable region is. Such a boundary means that it is impossible to detect whether the observations are from the null against the alternative as long as is smaller than it, while there exists some test which can distinguish between the two hypotheses whenever is far larger than it.
To characterize the testable region boundary, we introduce the separating rate of null against alternative or . For any fixed significance level and power , the separating rate for alternative or is said to be if there exist some test of asymptotic significance level and some absolute large constant such that
while there exists some absolute small constant such that for any test of asymptotic significance level , it holds that