GraSPy: Graph Statistics in Python
We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The documentation and all releases are available at https://neurodata.io/graspy.
Graphs, or networks, are a mathematical representation of data that consists of discrete objects (nodes or vertices) and relationships between these objects (edges). For example, if thinking of regions of a human brain as vertices, the edges can represent how strongly each pair of regions are connected to each other. Since graphs necessarily deal with relationships between nodes, many of the classical statistical assumptions about independence are violated. Thus, specific statistical methodology is required for performing robust statistical inference on graphs and populations of graphs . GraSPy fills this gap by providing implementations of algorithms with strong statistical guarantees, such as graph and multi-graph embedding methods, two-graph hypothesis testing, and clustering of vertices of graphs. Many of the algorithms implemented in GraSPy are flexible and can operate on graphs that are weighted or unweighted, as well as directed or undirected.
2 Library Overview
Overview of submodules available in GraSPy is summarized in Figure 1. The library contains functionality for fitting and sampling from random graph models, performing dimensionality reduction on graphs or populations of graphs (embedding), testing hypotheses on graphs, and plotting of graphs and embeddings.
The following provides brief overview of different submodules of GraSPy, and more detailed overview and code usage can be found in the tutorial section of GraSPy documentation at https://graspy.neurodata.io/tutorial.
Simulations (Figure 1a) Three classes of random graph models are implemented in GraSPy: 1) Erdős-Rényi (ER) model, 2) stochastic block model (SBM), and 3) random dot product graph (RDPG) model. ER model is the simplest model, in which the model is parameterized by the number of vertices, , and either that specifies a probability of an edge existing between a pair of vertices or that specifies the exact number of edges. All nodes have the same probability of connection to each other under the ER model. Unlike ER models, the SBM produces graphs containing communities, where vertices in each community share common probabilities of connection to every other community. The SBM is parameterized by the number of communities, , and a probability matrix, , that specifies the probability of edges within and between communities. An extension of the SBM, the Degree-corrected SBM (DCSBM) has an added parameter associated with each node that denotes its promiscuity in the graph, which is its relative degree among the other nodes in its community. Nodes still share the same relative probabilities of connection to each community, but the nodes within a community may have heterogeneous overall degrees. The RDPG model assumes that each vertex in the graph is associated with a latent vector in . The probability of an edge existing between pairs of vertices is determined by the dot product of the associated latent position vectors . The RDPG is parameterized by an by matrix of these latent positions. GraSPy provides implementations for sampling from each of these graph models given these parameters, as well as estimating the parameters of a model from a given graph. GraSPy also allows for weighting functions and directed graphs when sampling from these models.
Preprocessing (Figure 1b) Various utility functions help the user input real data into GraSPy or check simple attributes about a graph. Some examples include finding the largest connected component of a graph, finding the intersection or union of connected components across multiple graphs, transforming the weights of a graph, or checking whether a graph is directed. These functions speed the user’s workflow when working with real data that may be messy or noisy before preprocessing.
Embedding (Figure 1c) Inference on random graphs depends on low-dimensional Euclidean representation of the vertices of graphs, known as latent positions, typically given by spectral decompositions of adjacency or Laplacian matrices . Adjacency spectral embedding (ASE) and Laplacian spectral embedding (LSE) are methods for embedding a single graph, and omnibus embedding allows for embedding multiple graphs into the same dimensions such that the embeddings can be meaningfully compared. In addition, the number of embedding dimensions can be automatically chosen by the algorithm of .
Hypothesis Testing (Figure 1d) Given two graphs, a natural question to ask is whether these graphs are both random samples from the same generative distribution. GraSPy provides two types of test for this null hypothesis: semiparametric and nonparametric. Both tests are framed under the RDPG model, where the generative distribution can be modeled as a set of latent positions. The semiparametric test can only be performed on two graphs of the same size and with known correspondence between the vertices of the two graphs . Nonparametric testing can be performed on graphs without vertex alignment, or even with different numbers of vertices . Both tests provide a statistically principled way of claiming whether two observed graphs are the same; for example, one can test whether the brain connectivity graphs of siblings or twins came from the same generative distribution (Chung et al., in preparation).
Clustering (Figure 1e) GraSPy uses Gaussian mixture models (GMM) and k-means to compute the grouping structure of vertices after embedding. The number of clusters to fit for GMM is chosen by Bayesian information criterion (BIC), which is a penalized likelihood function to evaluate the quality of estimators. Similarly, the silhouette score is used to choose the number of clusters for k-means. Both functions sweep over a range of parameters and use the above metrics to choose clustering parameters in an unsupervised manner.
Plotting (Figure 1f) GraSPy extends seaborn to visualize graphs as adjacency matrices and embedded graphs as paired scatter plots . Individual graphs can be visualized using heatmap function, and multiple graphs can be overlaid on top of each other using gridplot function. Both adjacency matrix visualizations can be sorted by various node metadata. pairplot can visualize high dimensional data, such as graphs in the embedded space, as a pairwise scatter plot.
GraSPy is the first open-source Python package to perform robust statistical analysis on graphs and graph populations. Its compliance with the scikit-learn API makes it an easy to use tool for anyone familiar with machine learning in Python. In addition, GraSPy is implemented with an extensible class structure, making it easy to modify and add new algorithms to the package. As GraSPy continues to grow and add functionality, we believe it will accelerate statistically-valid discovery in any field of study concerned with populations of graphs.
References and Notes
-  A. Athreya, D. E. Fishkind, M. Tang, C. E. Priebe, Y. Park, J. T. Vogelstein, K. Levin, V. Lyzinski, Y. Qin, and D. L. Sussman, “Statistical inference on random dot product graphs: a survey,” Journal of Machine Learning Research, vol. 18, no. 226, pp. 1–92, 2018. [Online]. Available: http://jmlr.org/papers/v18/17-448.html
-  S. J. Young and E. R. Scheinerman, “Random dot product graph models for social networks,” in International Workshop on Algorithms and Models for the Web-Graph. Springer, 2007, pp. 138–149.
-  K. Eichler, F. Li, A. Litwin-Kumar, Y. Park, I. Andrade, C. M. Schneider-Mizell, T. Saumweber, A. Huser, C. Eschbach, B. Gerber et al., “The complete connectome of a learning and memory centre in an insect brain,” Nature, vol. 548, no. 7666, p. 175, 2017.
-  K. Levin, A. Athreya, M. Tang, V. Lyzinski, and C. E. Priebe, “A central limit theorem for an omnibus embedding of multiple random dot product graphs,” pp. 964–967, 2017.
-  M. Zhu and A. Ghodsi, “Automatic dimensionality selection from the scree plot via the use of profile likelihood,” Computational Statistics & Data Analysis, vol. 51, no. 2, pp. 918–930, 2006.
-  M. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, Y. Park, and C. E. Priebe, “A semiparametric two-sample hypothesis testing problem for random graphs,” Journal of Computational and Graphical Statistics, vol. 26, no. 2, pp. 344–354, 2017.
-  M. Tang, A. Athreya, D. L. Sussman, V. Lyzinski, and C. E. Priebe, “A nonparametric two-sample hypothesis testing problem for random dot product graphs,” Journal of Computational and Graphical Statistics, Sep. 2014.
-  M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, J. Ostblom, S. Lukauskas, D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K. Meyer, A. Miles, Y. Ram, T. Brunner, T. Yarkoni, M. L. Williams, C. Evans, C. Fitzgerald, Brian, and A. Qalieh, “mwaskom/seaborn: v0.9.0 (july 2018),” Jul. 2018. [Online]. Available: https://doi.org/10.5281/zenodo.1313201