Practical Methods for Graph TwoSample Testing
Abstract
Hypothesis testing for graphs has been an important tool in applied research fields for more than two decades, and still remains a challenging problem as one often needs to draw inference from few replicates of large graphs. Recent studies in statistics and learning theory have provided some theoretical insights about such highdimensional graph testing problems, but the practicality of the developed theoretical methods remains an open question.
In this paper, we consider the problem of twosample testing of large graphs. We demonstrate the practical merits and limitations of existing theoretical tests and their bootstrapped variants. We also propose two new tests based on asymptotic distributions. We show that these tests are computationally less expensive and, in some cases, more reliable than the existing methods.
Practical Methods for Graph TwoSample Testing
Debarghya Ghoshdastidar Department of Computer Science University of Tübingen ghoshdas@informatik.unituebingen.de Ulrike von Luxburg Department of Computer Science University of Tübingen Max Planck Institute for Intelligent Systems luxburg@informatik.unituebingen.de
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Hypothesis testing is one of the most commonly encountered statistical problems that naturally arises in nearly all scientific disciplines. With the widespread use of networks in bioinformatics, social sciences and other fields since the turn of the century, it was obvious that the hypothesis testing of graphs would soon become a key statistical tool in studies based on network analysis. The problem of testing for differences in networks arises quite naturally in various situations. For instance, Bassett et al. (2008) study the differences in anatomical brain networks of schizophrenic patients and healthy individuals, whereas Zhang et al. (2009) test for statistically significant topological changes in gene regulatory networks arising from two different treatments of breast cancer. As Clarke et al. (2008) and Hyduke et al. (2013) point out, the statistical challenge associated with network testing is the curse of dimensionality as one needs to test large graphs based on few independent samples. Ginestet et al. (2014) show that complications can also arise due to the widespread use of multiple testing principles that rely on performing independent tests for every edge.
Although network analysis has been a primary research topic in statistics and machine learning, theoretical developments related to testing random graphs have been rather limited until recent times. Property testing of graphs has been well studied in computer science (Goldreich et al., 1998), but probably the earliest instances of the theory of random graph testing are the works on community detection, which use hypothesis testing to detect if a network has planted communities or to determine the number of communities in a block model (AriasCastro and Verzelen, 2014; Bickel and Sarkar, 2016; Lei, 2016). In the present work, we are interested in the more general and practically important problem of twosample testing: Given two populations of random graphs, decide whether both populations are generated from the same distribution or not. While there have been machine learning approaches to quantify similarities between graphs for the purpose of classification, clustering etc. (Borgwardt et al., 2005; Shervashidze et al., 2011), the use of graph distances for the purpose of hypothesis testing is more recent (Ginestet et al., 2017). Most approaches for graph testing based on classical twosample tests are applicable in the relatively lowdimensional setting, where the population size (number of graphs) is larger than the size of the graphs (number of vertices). However, Hyduke et al. (2013) note that this scenario does not always apply because the number of samples could be potentially much smaller — for instance, one may need to test between two large regulatory networks (that is, population size is one). Such scenarios can be better tackled from a perspective of highdimensional statistics as shown in Tang et al. (2016); Ghoshdastidar et al. (2017a), where the authors study twosample testing for specific classes of random graphs with particular focus on the small population size.
In this work, we focus on the framework of the graph twosample problem considered in Tang et al. (2016); Ginestet et al. (2017); Ghoshdastidar et al. (2017a), where all graphs are defined on a common set of vertices. Assume that the number of vertices in each graph is , and the sample size of either population is . One can consider the twosample problem in three different regimes: (i) is large; (ii) , but much smaller than ; and (iii) . The first setting is the simplest one, and practical tests are known in this case (Gretton et al., 2012; Ginestet et al., 2017). However, there exist many application domains where already the availability of only a small population of graphs is a challenge, and large populations are completely out of bounds. The latter two cases of small and have been studied in Ghoshdastidar et al. (2017a) and Tang et al. (2016), where theoretical tests based on concentration inequalities have been developed and practical bootstrapped variants of the tests have been suggested. The contribution of the present work is threefold:

For the cases of and , we propose new tests that are based on asymptotic null distributions under certain model assumptions and we prove their statistical consistency (Sections 4 and 5 respectively). The proposed tests are devoid of bootstrapping, and hence, computationally faster than existing bootstrapped tests for small . Detailed descriptions of the tests are provided in the Appendix B.

Our aim is also to make the existing and proposed tests more accessible for applied research. We also provide Matlab implementations of the tests.
The present work is focused on the assumption that all networks are defined over the same set of vertices. This may seem restrictive in some application areas, but it is commonly encountered in other areas such as brain network analysis or molecular interaction networks, where vertices correspond to welldefined regions of the brain or protein structures. Few works study the case where graphs do not have vertex correspondences in context of clustering (Mukherjee et al., 2017) and testing (Ghoshdastidar et al., 2017b; Tang et al., 2017). But, theoretical guarantees are only known for specific choices of network functions (triangle counts or graph spectra), or under the assumption of an underlying embedding of the vertices.
Notation. We use the asymptotic notation and , where the asymptotics are with respect to the number of vertices . We say and when . We denote the matrix Frobenius norm by and the spectral norm or largest singular value by .
2 Problem Statement
We consider the following framework of twosample setting. Let be a set of vertices. Let and be two populations of undirected unweighted graphs defined on the common vertex set , where each population consists of independent and identically distributed samples. The twosample hypothesis testing problem is as follows:
Test whether and are generated from the same random model or not.
There exist a plethora of nonparametric tests that are provably consistent for . In particular, kernel based tests (Gretton et al., 2012) are known to be suitable for twosample problems in large dimensions. These tests, in conjunction with graph kernels (Shervashidze et al., 2011; Kondor and Pan, 2016) or distances (Mukherjee et al., 2017), may be used to derive consistent procedures for testing between two large populations of graphs. Such principles are applicable even under a more general framework without vertex correspondence (see Gretton et al., 2012). However, given graphs on a common vertex set, the most natural approach is to construct tests based on the graph adjacency matrix or the graph Laplacian (Ginestet et al., 2017). To be precise, one may view each undirected graph on vertices as a dimensional vector and use classical twosample tests based on the or statistics (Anderson, 1984). Unfortunately, such tests require an estimate of the dimensional sample covariance matrix, which cannot be accurately obtained from a moderate sample size. For instance, Ginestet et al. (2017) need regularisation of the covariance estimate even for moderate sized problems , and it is unknown whether such methods work for brain networks obtained from a singlelab experimental setup (). For , it is indeed hard to prove consistency results under the general twosample framework described above since the correlation among the edges can be arbitrary. Hence, we develop our theory for random graphs with independent edges. Tang et al. (2016) show that tests derived for such graphs are also useful in practice.
We assume that the graphs are generated from the inhomogeneous ErdősRényi (IER) model (Bollobas et al., 2007). This model has been considered in the work of Ghoshdastidar et al. (2017a) and subsumes other models studied in the context of graph testing such as dot product graphs (Tang et al., 2016) and stochastic block models (Lei, 2016). Given a symmetric matrix with zero diagonal, a graph is said to be an IER graph with population adjacency , denoted as , if its symmetric adjacency matrix satisfies:
For any , we state the twosample problem as follows. Let be two symmetric matrices. Given and , test the hypotheses
(1) 
Our theoretical results in subsequent sections will often be in the asymptotic case as . For this, we assume that there are two sequences of models and , and the sequences are identical under the null hypothesis . We derive asymptotic powers of the proposed tests assuming certain separation rates under the alternative hypothesis.
3 Testing large population of graphs
Before proceeding to the case of small population size, we discuss a baseline approach that is designed for the large regime (). The following discussion provides a type test statistic for networks, which is a simplification of Ginestet et al. (2017) under the IER assumption. Given the adjacency matrices and , consider the test statistic
(2) 
where . It is easy to see that under , in distribution as for any fixed . This suggests a type test similar to Ginestet et al. (2017). However, like any classical test, no performance guarantee can be given for small and our numerical results show that such a test is powerless for small and sparse graphs. Hence, in the rest of the paper, we consider tests that are powerful even for small .
4 Testing small populations of large graphs
The case of small for IER graphs was first studied from a theoretical perspective in Ghoshdastidar et al. (2017a), and the authors also show that, under a minimax testing framework, the testing problem is quite different for and . From a practical perspective, small is a common situation in neural imaging with only few subjects. The case of is also interesting for testing between two individuals based on testretest diffusion MRI data, where two scans are collected from each subject with a separation of multiple weeks (Landman et al., 2011).
Under the assumption of IER models described in Section 2 and given the adjacency matrices and , Ghoshdastidar et al. (2017a) propose test statistics based on estimates of the distances and up to certain normalisation factors that account for sparsity of the graphs. They consider the following two test statistics
(3)  
(4) 
Subsequently, theoretical tests are constructed based on concentration inequalities: one can show that with high probability, the test statistics are smaller than some specified threshold under the null hypothesis, but they exceed the same threshold if the separation between and is large enough. In practice, however, the authors note that the theoretical thresholds are too large to be exceeded for moderate , and recommend estimation of the threshold through bootstrapping. Each bootstrap sample is generated by randomly partitioning the entire population into two parts, and or are computed based on this random partition. This procedure provides an approximation of the statistic under the null model. We refer to these tests as BootSpectral and BootFrobenius, and show their limitations for small via simulations. Detailed descriptions of these tests are included in Appendix B.
We now propose a test based on the asymptotic behaviour of in (4) as . We state the asymptotic behaviour in the following result.
Theorem 1 (Asymptotic test based on ).
In the twosample framework of Section 2, assume have entries bounded away from 1, and satisfy .
Under the null hypothesis, is dominated by a standard normal random variable, and hence, for any ,
(5) 
where is the upper quantile of the standard normal distribution.
On the other hand, if , then
(6) 
The proof, given in Appendix A, is based on the use of the BerryEsseen theorem (Berry, 1941). Using Theorem 1, we propose an level test based on asymptotic normal dominance of .
Proposed Test AsympNormal: Reject the null hypothesis if .
A detailed description of this test is given in Appendix B. The assumption is not restrictive since it is quite similar to assuming that the number of edges is superlinear in , that is, the graphs are not too sparse. We note that unlike the test of Section 2, here the asymptotics are for instead of , and hence, the behaviour under null hypothesis may not improve for larger . The asymptotic unit power of the AsympNormal test, as shown in Theorem 1, is proved under a separation condition, which is not surprising since we have access to only a finite number of graphs. The result also shows that for large , smaller separations can be detected by the proposed test.
Remark 2 (Computational effort).
Note that the computational complexity for computing the test statistics in (3) and (4) is linear in the total number of edges in the entire population. However, the bootstrap tests require computation of the test statistic multiple times (equal to number of bootstrap samples ; we use in our experiments). On the other hand, the proposed test compute the statistic once, and is much faster (200 times). Moreover, if the graphs are too large to be stored in memory, bootstrapping requires multiple passes over the data, while the proposed test requires only a single pass.
5 Testing difference between two large graphs
The case of is perhaps the most interesting from theoretical perspective: the objective is to detect whether two large graphs and are identically distributed or not. This finds application in detecting differences in regulatory networks (Zhang et al., 2009) or comparing brain networks of individuals (Tang et al., 2016). Although the concentration based test using is applicable even for (Ghoshdastidar et al., 2017a), bootstrapping based on label permutation is infeasible for since there is no scope of permuting labels with unit population size. Tang et al. (2016), however, propose a concentration based test in this case and suggest a bootstrapping based on low rank assumption of the population adjacency. Tang et al. (2016) study the twosample problem for random dot product graphs, which are IER graphs with low rank population adjacency matrices (ignoring the effect of zero diagonal). This class includes the stochastic block model, where the rank equals the number of communities. Let and , and assume that and are of rank . One defines the adjacency spectral embedding (ASE) of graph as , where is a diagonal matrix containing largest singular values of and is the matrix of corresponding left singular vectors. Tang et al. (2016) propose the test statistic
(7) 
where the rank is assumed to be known. The rotation matrix aligns the ASE of the two graphs. Tang et al. (2016) theoretically analyse a concentration based test, where the null hypothesis is rejected if crosses a suitably chosen threshold. In practice, they suggest the following bootstrapping to determine the threshold (Algorithm 1 in Tang et al., 2016). One may approximate by the estimated population adjacency (EPA) . More random dot product graphs can be simulated from , and a bootstrapped threshold can be obtained by computing for pairs of graphs generated from . Instead of the statistic, one may also use a statistic based on EPA as
(8) 
This statistic has been used as distance measure in the context of graph clustering (Mukherjee et al., 2017). We refer to the tests based on the statistics in (7) and (8), and the above bootstrapping procedure by BootASE and BootEPA (see Appendix B for detailed descriptions). We find that the latter performs better, but both tests work under the condition that the population adjacency is of low rank, and the rank is precisely known. Our numerical results demonstrate the limitations of these tests when the rank is not correctly known.
Alternatively, we propose a test based on the asymptotic distribution of eigenvalues that is not restricted to graphs with low rank population adjacencies. Given and , consider the matrix with zero diagonal and for ,
(9) 
We assume that the entries of and are not arbitrarily close to 1, and define when . We show that the extreme eigenvalues of asymptotically follow the TracyWidom law, which characterises the distribution of the largest eigenvalues of matrices with independent standard normal entries (Tracy and Widom, 1996). Subsequently, we show that is a useful test statistic.
Theorem 3 (Asymptotic test based on ).
Consider the above setting of twosample testing, and let be as defined in (9). Let and be the largest and smallest eigenvalues of .
Under the null hypothesis, that is, if for all , then
in distribution as , where is the TracyWidom law for orthogonal ensembles. Hence,
(10) 
for any , where is the upper quantile of the distribution.
On the other hand, if and are such that , then
(11) 
The proof, given in Appendix A, relies on results on the spectrum of random matrices (Erdős et al., 2012; Lee and Yin, 2014), and have been previously used for the special case of determining the number of communities in a block model (Bickel and Sarkar, 2016; Lei, 2016). If the graphs are assumed to be block models, then asymptotic power can be proved under more precise conditions on difference in population adjacencies (see Appendix A.3). From a practical perspective, cannot be computed since and are unknown. Still, one may approximate them by relying on a weaker version of Szemerédi’s regularity lemma, which implies that large graphs can be approximated by stochastic block models with possibly large number of blocks (Lovász, 2012). To this end, we propose to estimate from as follows. We use a community detection algorithm, such as normalised spectral clustering (Ng et al., 2002), to find communities in ( is a parameter for the test). Subsequently is approximated by a block matrix such that if lie in communities respectively, then is the mean of the submatrix of restricted to . Similarly one can also compute from . Hence, we propose a TracyWidom test statistic as
(12)  
and the diagonal is zero. The proposed level test based on and Theorem 3 is the following.
Proposed Test AsympTW: Reject the null hypothesis if .
A detailed description of the test, as used in our implementations, is given in Appendix B. We note that unlike bootstrap tests based on or , the proposed test uses the number of communities (or rank) only for approximation of , and the power of the test is not sensitive to the choice of . In addition, the computational benefit of a distribution based test over bootstrap tests, as noted in Remark 2, is also applicable in this case.
6 Numerical results
In this section, we empirically compare the merits and limitations of the tests discussed in the paper. We present our numerical results in three groups: (i) results for random graphs for , (ii) results for random graphs for , and (iii) results for testing real networks. For , we consider four tests. BootSpectral and BootFrobenius are the bootstrap tests based on (3) and (4), respectively. AsympChi2 is the type test based on (2), which is suited for the large setting, and finally, the proposed test AsympNormal is based on the normal dominance of as as shown in Theorem 1. For , we consider three tests. BootASE and BootEPA are the bootstrap tests based on (7) and (8), respectively. AsympTW is the proposed test based on (12) and Theorem 3. Appendices B and C contain descriptions of all tests and additional numerical results.^{1}^{1}1Matlab codes available at: https://github.com/gdebarghya/NetworkTwoSampleTesting
6.1 Comparative study on random graphs for
For this study, we generate graphs from stochastic block models with 2 communities as considered in Tang et al. (2016). We define and as follows. The vertex set of size is partitioned into two communities, each of size . In , edges occur independently with probability within each community, and with probability between two communities. has the same block structure as , but edges occur with probability within each community. Under the null hypothesis and hence , whereas under the alternative hypothesis, we set .
In our first experiment, we study the performance of different tests for varying and . We let grow from 100 to 1000 in steps of 100, and set and . We set and 0.04 for null and alternative hypotheses, respectively. We use two values of population size, , and fix the significance level at . Figure 1 shows the rate of rejecting the null hypothesis (test power) computed from 1000 independent runs of the experiment. Under the null model, the test power should be smaller than , whereas under the alternative model, a high test power (close to 1) is desirable. We see that for , only AsympNormal has power while the bootstrap tests have zero rejection rate. This is not surprising as bootstrapping is impossible for . For , BootFrobenius has a behaviour similar to AsympNormal although the latter is computationally much faster. BootSpectral achieves a higher power for small but cannot achieve unit power. AsympChi2 has an erratic behaviour for small , and hence, we study it for larger sample size in Figure 3 (in Appendix C). As is expected, AsympChi2 has desired performance only for .
We also study the effect of edge sparsity on the performance of the tests. For this, we consider the above setting, but scale the edge probabilities by a factor of , where is exactly same as the above setting while larger corresponds to denser graphs. Figure 4 in the appendix shows the results in this case, where we fix and vary and . We again find that AsympNormal and BootFrobenius have similar trends for . All tests perform better for dense graphs, but BootSpectral may be preferred for sparse graphs when .
Under null hypothesis Under alternative hypothesis  

Test power (null rejection rate) 

Number of vertices 
6.2 Comparative study on random graphs for
We conduct similar experiments for the case of . Recall that bootstrap tests for work under the assumption that the population adjacencies are of low rank. This holds in above considered setting of block models, where the rank is 2. We first demonstrate the effect of knowledge of true rank on the test power. We use to specify the rank parameter for bootstrap tests, and also as the number of blocks used for community detection step of AsympTW. Figure 2 shows the power of the tests for the above setting with and growing . We find that when , that is, true rank is known, both bootstrap tests perform well under alternative hypothesis, and outperform AsympTW, although BootASE has a high typeI error rate. However, when an overestimate of rank is used , both bootstrap tests break down — BootEPA always rejects while BootASE always accepts — but the performance of AsympTW is robust to this parameter change.
We also study the effect of sparsity by varying (see Figure 5 in Appendix C). We only consider the case . We find that all tests perform better in dense regime, and the rejection rate of AsympTW under null is below 5% even for small graphs. However, the performance of both BootASE and AsympTW are poor if the graphs are too sparse. Hence, BootEPA may be preferable for sparse graphs, but only if the rank is correctly known.
Under null hypothesis Under alternative hypothesis  

Test power (null rejection rate) 

Number of vertices 
6.3 Qualitative results for testing real networks
We use the proposed asymptotic tests to analyse two real datasets. These experiments demonstrate that the proposed tests are applicable beyond the setting of IER graphs. In the first setup, we consider moderate sized graphs constructed by thresholding autocorrelation matrices of EEG recordings (Andrzejak et al., 2001; Dua and Taniskidou, 2017). The network construction is described Appendix C.2. Each group of networks corresponds to either epileptic seizure activity or four other resting states. In Tables 1–4 in Appendix C, we report the test powers and pvalues for AsympNormal and AsympTW. We find that, except for one pair of resting states, networks for different groups can be distinguished by both tests. Further observations and discussions are also provided in the appendix.
We also study networks corresponding to peering information of autonomous systems, that is, graphs defined on the routers comprising the Internet with the edges representing whotalkstowhom (Leskovec et al., 2005; Leskovec and Krevl, 2014). The information for systems was collected once a week for nine consecutive weeks, and two networks are available for each date based on two sets of information . We run AsympNormal test for every pair of dates and report the pvalues in Table 5 (Appendix C.3). It is interesting to observe that as the interval between two dates increase, the pvalues decrease at an exponential rate, that is, the networks differ drastically according to our tests. We also conduct semisynthetic experiments by randomly perturbing the networks, and study the performance of AsympNormal and AsympTW as the perturbations increase (see Figures 6–7). Since the networks are large and sparse, we perform the community detection step of AsympTW using BigClam (Yang and Leskovec, 2013) instead of spectral clustering. We infer that the limitation of AsympTW in sparse regime (observed in Figure 5) could possibly be caused by poor performance of standard spectral clustering in sparse regime.
7 Concluding remarks
In this work, we consider the twosample testing problem for undirected unweighted graphs defined on a common vertex set. This problem finds application in various domains, and is often challenging due to unavailability of large number of samples (small ). We study the practicality of existing theoretical tests, and propose two new tests based on asymptotics for large graphs (Thereoms 1 and 3). We perform numerical comparison of various tests, and also provide their Matlab implementations. In the case, we find that BootSpectral is effective for , but AsympNormal is recommended for smaller since it is more reliable and requires less computation. For , we recommend AsympTW due to robustness to the rank parameter and computational advantage. For large sparse graphs, AsympTW should be used with a robust community detection step (BigClam).
One can certainly extend some of these tests to more general frameworks of graph testing. For instance, directed graphs can be tackled by modifying such that the summation is over all and Theorem 1 would hold even in this case. For weighted graphs, Theorem 3 can be used if one modifies (9) by normalising with variance of . Subsequently, these variances can be approximated again through block modelling. For , we believe that unequal population sizes can be handled by rescaling the matrices appropriately, but we have not verified this.
Acknowledgements
This work is supported by the German Research Foundation (Research Unit 1735) and the Institutional Strategy of the University of Tübingen (DFG, ZUK 63).
References
 Anderson [1984] T. W. Anderson. An introduction to multivariate statistical analysis. John Wiley and Sons, 1984.
 Andrzejak et al. [2001] R. G. Andrzejak, K. Lehnertz, C. Rieke, F. Mormann, P. David, and C. E. Elger. Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64:061907, 2001.
 AriasCastro and Verzelen [2014] E. AriasCastro and N. Verzelen. Community detection in dense random networks. Annals of Statistics, 42(3):940–969, 2014.
 Bassett et al. [2008] D. S. Bassett, E. Bullmore, B. A. Verchinski, V. S. Mattay, D. R. Weinberger, and A. MeyerLindenberg. Hierarchical organization of human cortical networks in health and schizophrenia. The Journal of Neuroscience, 28(37):9239–9248, 2008.
 Berry [1941] A. C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49(1):122–136, 1941.
 Bickel and Sarkar [2016] P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(1):253–273, 2016.
 Bollobas et al. [2007] B. Bollobas, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs. Random Structures and Algorithms, 31(1):3–122, 2007.
 Borgwardt et al. [2005] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–56, 2005.
 Bornemann [2010] F. Bornemann. On the numerical evaluation of distributions in random matrix theory. Markov Processes and Related Fields, 16:803–866, 2010.
 Clarke et al. [2008] R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan, and Y. Wang. The properties of highdimensional data spaces: Implications for exploring gene and protein expression data. Nature Reviews Cancer, 8:37–49, 2008.
 Dua and Taniskidou [2017] D. Dua and K. Taniskidou. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017.
 Erdős et al. [2012] L. Erdős, H.T. Yau, and J. Yin. Rigidity of eigenvalues of generalized Wigner matrices. Advances in Mathematics, 229(3):1435–1515, 2012.
 Ghoshdastidar et al. [2017a] D. Ghoshdastidar, M. Gutzeit, A. Carpentier, and U. von Luxburg. Twosample hypothesis testing for inhomogeneous random graphs. arXiv preprint (arXiv:1707.00833), 2017a.
 Ghoshdastidar et al. [2017b] D. Ghoshdastidar, M. Gutzeit, A. Carpentier, and U. von Luxburg. Twosample tests for large random graphs using network statistics. In Conference on Learning Theory (COLT), 2017b.
 Ginestet et al. [2014] C. E. Ginestet, A. P. Fournel, and A. Simmons. Statistical network analysis for functional MRI: Summary networks and group comparisons. Frontiers in computational neuroscience, 8(51):10.3389/fncom.2014.00051, 2014.
 Ginestet et al. [2017] C. E. Ginestet, J. Li, P. Balachandran, S. Rosenberg, and E. D. Kolaczyk. Hypothesis testing for network data in functional neuroimaging. The Annals of Applied Statistics, 11(2):725–750, 2017.
 Goldreich et al. [1998] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653–750, 1998.
 Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723–733, 2012.
 Hyduke et al. [2013] D. R. Hyduke, N. E. Lewis, and B. Palsson. Analysis of omics data with genomescale models of metabolism. Molecular BioSystems, 9(2):167–174, 2013.
 Kondor and Pan [2016] R. Kondor and H. Pan. The multiscale Laplacian graph kernel. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Landman et al. [2011] B. A. Landman, A. J. Huang, A. Gifford, D. S. Vikram, I. A. Lim, J. A. Farrell, J. A. Bogovic, J. Hua, M. Chen, S. Jarso, S. A. Smith, S. Joel, S. Mori, J. J. Pekar, P. B. Barker, J. L. Prince, and P. C. van Zijl. Multiparametric neuroimaging reproducibility: A 3T resource study. Neuroimage, 54(4):2854–2866, 2011.
 Lee and Yin [2014] J. O. Lee and J. Yin. A necessary and sufficient condition for edge universality of Wigner matrices. Duke Mathematical Journal, 163(1):117–173, 2014.
 Lei [2016] J. Lei. A goodnessoffit test for stochastic block models. The Annals of Statistics, 44(1):401–424, 2016.
 Leskovec and Krevl [2014] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014.
 Leskovec et al. [2005] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densification laws, shrinking diameters and possible explanations. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005.
 Lovász [2012] L. Lovász. Large networks and graph limits. American Mathematical Society, 2012.
 Mukherjee et al. [2017] S. S. Mukherjee, P. Sarkar, and L. Lin. On clustering networkvalued data. In Advances in Neural Information Processing Systems (NIPS), 2017.
 Ng et al. [2002] A. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.
 Shervashidze et al. [2011] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. WeisfeilerLehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
 Tang et al. [2016] M. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, and C. E. Priebe. A semiparametric twosample hypothesis testing problem for random graphs. Journal of Computational and Graphical Statistics, 26(2):344–354, 2016.
 Tang et al. [2017] M. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, and C. E. Priebe. A nonparametric twosample hypothesis testing problem for random graphs. Bernoulli, 23:1599–1630, 2017.
 Tracy and Widom [1996] C. A. Tracy and H. Widom. On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics, 177:727–754, 1996.
 Yang and Leskovec [2013] J. Yang and J. Leskovec. Overlapping community detection at scale: A nonnegative matrix factorization approach. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), pages 587–596, 2013.
 Zhang et al. [2009] B. Zhang, H. Li, R. B. Riggins, M. Zhan, J. Xuan, Z. Zhang, E. P. Hoffman, R. Clarke, and Y. Wang. Differential dependency network analysis to identify conditionspecific topological changes in biological networks. Bioinformatics, 25(4):526–532, 2009.
Appendix
Here, we provide additional details such as proofs, description of tests, additional numerical results and discussions. Section A provides proofs for the theorems stated in the paper along with a corollary of Theorem 3. Section B provides detailed descriptions of all tests considered in our implementations, both existing tests as well as proposed ones. Section C provides additional numerical results, which we have referred to in the paper.
Appendix A Proofs for results
In this section, we present the proofs for Theorems 1 and 3, which provide the theoretical foundations for the proposed tests AsympNormal and AsympTW, respectively.
a.1 Proof of Theorem 1
For convenience, we assume is even. The extension to odd is straightforward. We also write instead of and define
Also let , , and .
Under the null hypothesis, that is , are centred mutually independent random variables, and hence, due to the central limit theorem, we can claim that converges to a standard normal random variable as . The rate of convergence is given by the BerryEsseen theorem [Berry, 1941] as
where is the distribution function for . Recall our assumption that the entries are bounded away from 1. Let for some . Observe that is product of two i.i.d. random variables, where each of them is a difference of two binomials. Hence, under , we can compute
and by using the CauchySchwarz inequality,
Hence, the BerryEsseen bound can be written as
since . We now compute the probability of typeI error in the following way:
(13) 
for any . Using the BerryEsseen bound, we bound the first term as
where we use in the last step. Taking leads to a bound .
We now deal with the second term in (13). Observe that . Hence, we have
by the Chebyshev inequality. We can compute the variance term for any as
(14)  
In particular, under , . Using this, the Chebyshev bound is smaller than for . Hence, we obtained the claimed typeI error bound.
For the typeII error rate, we consider the stated separation condition in the form . We can bound the error probability as
For the second term, we use the Chebyshev inequality as above to show that the probability is since . For the first term, observe that we have under the separation condition, and hence for any fixed , we have for large enough . So,
One can compute similar to (14) to obtain
where the second inequality follows from use of the CauchySchwarz inequality followed by the observation that norm is smaller than norm. Hence, the error probability is bounded as
under the assumed separation. Hence, the claim.
a.2 Proof of Theorem 3
We first derive the asymptotic distribution under the null hypothesis. This part is similar to the proof of Lemma A.1 in Lei [2016]. Observe that under , in (9) is a symmetric random matrix, whose entries above the diagonal are independent with mean zero and variance . Now, let be a symmetric random matrix with zero diagonal, whose entries above the diagonal are i.i.d. normal with mean zero and variance . Due to the results of Erdős et al. [2012], we know that and have the same limiting distribution. Lee and Yin [2014] show that as , and hence the same conclusion holds for . The corresponding result for can be proved by considering the matrix . Based on this asymptotic result, we have
where is the upper quantile of the distribution. Since, , an union bound leads to the stated conclusion under the null hypothesis.
Under the alternative hypothesis, one can see that is a rescaled version of with each entry being scaled by normalising term of (we drop the superscript for convenience). Under the stated separation condition on , it is easy to see that with high probability. So, the probability of the test statistic being smaller than is . To be precise, we decompose as , and using Weyl’s inequality, we can write
with probability at most . The second inequality follows by noting that is a mean zero matrix whose spectral norm can be bounded using the arguments stated under the null hypothesis. Hence, with probability . We set , and observe that , that is , if .
a.3 Theorem 3 for stochastic block models
We state the following corollary, which provides an understanding of the condition on in Theorem 3 under a block model assumption.
Corollary 4.
Assume that correspond to stochastic block models with at most communities, and let . If , then
(15) 
One can observe that if is bounded by a constant and all entries of are of the same order (same as ), then the above separation condition is similar to the one stated in Theorem 1.
Proof.
The claim would follow if we show that under the stated separation, the condition on used in Theorem 3 holds. In fact, we show that in the present case, . For convenience, we simply write and define . Note that
and hence, has a block structure with at most blocks (ignoring that the diagonal is zero). Thus, there is a diagonal matrix such that has rank at most . Note that the diagonal entries of are same as the diagonal blocks of , and so, assuming that is bounded away from 1. Hence, we can write
which is under the stated condition. For the second inequality, we use the relation between spectral and Frobenius norms of a matrix with rank . Finally, Theorem 3 leads to the result. ∎
Appendix B Detailed description of tests
In this section, we describe all the tests discussed in this paper. First, we provide description of the asymptotic tests, which include the tests AsympNormal and AsympTW proposed in this paper, as well as the largesample test AsympChi2. We next describe the bootstrapped tests BootSpectral and BootFrobenius, which are based on approximating the null distribution by randomly permuting the group assignments of the graphs. Tang et al. [2016] provide an algorithmic description of BootASE. For completeness, we include this description along with that of BootEPA, which also generates bootstrap samples based on a low rank approximation of population adjacency. Throughout this section, we refer to the null hypothesis as the hypotheses that both graphs (or graph populations) have the same population adjacency.
b.1 Asymptotic tests
We first describe the AsympNormal test below. In addition to accepting or rejecting the null hypothesis, we also present how to compute the pvalue, which is defined as the probability that the null hypothesis is true. This is often useful to quantify the amount of dissimilarity between two populations. We use the standard rule of rejecting the null hypothesis when pvalue is less than the prescribed significance level . Note that in AsympNormal, the pvalue involves a factor of 2 to take into account both the upper and the lower tail probabilities.
The AsympChi2 test is listed below. For convenience, we write , where and denote the numerator and denominator of each term in the summation (2). This notation corresponds to the fact that is the sample mean difference for entry , and is an estimate of the variance of . We note that for sparse graphs and small , the summation in (2) may have terms of the form . Hence, we sum only over the set of edges in defined below.