On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching

Vince Lyzinski, Keith Levin, Donniell E. Fishkind, Carey E. Priebe
Human Language Technology Center of Excellence, Johns Hopkins University
Department of Computer Science, Johns Hopkins University
Department of Applied Mathematics and Statistics, Johns Hopkins University
Abstract

Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM), including a vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove that maximum likelihood (ML)-based vertex nomination is consistent, in the sense that the performance of the ML-based scheme asymptotically matches that of the Bayes optimal scheme. We prove theorems of this form both when model parameters are known and unknown. Additionally, we introduce and prove consistency of a related, more scalable restricted-focus ML vertex nomination scheme. Finally, we incorporate vertex and edge features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach.

1 Introduction and Background

Graphs are a common data modality, useful for modeling complex relationships between objects, with applications spanning fields as varied as biology (Jeong et al., 2001; Bullmore and Sporns, 2009), sociology (Wasserman and Faust, 1994), and computer vision (Foggia et al., 2014; Kandel et al., 2007), to name a few. For example, in neuroscience, vertices may be neurons and edges adjoin pairs of neurons that share a synapse (Bullmore and Sporns, 2009); in social networks, vertices may correspond to people and edges to friendships between them (Carrington et al., 2005; Yang and Leskovec, 2015); in computer vision, vertices may represent pixels in an image and edges may represent spatial proximity or multi-resolution mappings (Kandel et al., 2007). In many useful networks, vertices with similar attributes form densely-connected communities compared to vertices with highly disparate attributes, and uncovering these communities is an important step in understanding the structure of the network. There is an extensive literature devoted to uncovering this community structure in network data, including methods based on maximum modularity (Newman and Girvan, 2004; Newman, 2006b), spectral partitioning algorithms (Luxburg, 2007; Rohe et al., 2011; Sussman et al., 2012; Lyzinski et al., 2014b), and likelihood-based methods (Bickel and Chen, 2009), among others.

In the setting of vertex nomination, one community in the network is of particular interest, and the inference task is to order the vertices into a nomination list with those vertices from the community of interest concentrating at the top of the list. See Marchette et al. (2011); Coppersmith and Priebe (2012); Coppersmith (2014); Fishkind et al. (2015) and the references contained therein for a review of the relevant vertex nomination literature. Vertex nomination is a semi-supervised inference task, with example vertices from the community of interest—and, ideally, also examples not from the community of interest—being leveraged in order to create a nomination list. In this way, the vertex nomination problem is similar to the problem faced by personalized recommender systems (see, for example, Resnick and Varian, 1997; Ricci et al., 2011), where, given a training list of objects of interest, the goal is to arrange the remaining objects into a recommendation list with “interesting” objects concentrated at the top of the list. The main difference between the two inference tasks is that in vertex nomination the features of the data are encoded into the topology of a network, rather than being observed directly as features (though see Section 5 for the case where vertices are annotated with additional information in the form of features).

In this paper, we develop the notion of a consistent vertex nomination scheme (Definition 2). We then proceed to prove that the maximum likelihood vertex nomination scheme of Fishkind et al. (2015) is consistent under mild model assumptions on the underlying stochastic block model (Theorem 6). In the process, we propose a new, efficiently exactly solvable likelihood-based vertex nomination scheme, the restricted-focus maximum likelihood vertex nomination scheme, , and prove the analogous consistency result (Theorem 8). In addition, under mild model assumptions, we prove that both schemes maintain their consistency when the stochastic block model parameters are unknown and are estimated using the seed vertices (Theorems 9 and 10). In both cases, we show that consistency is possible even when the seeds are an asymptotically vanishing portion of the graph. Lastly, we show how both schemes can be easily modified to incorporate edge weights and vertex features (Section 5), before demonstrating the practical effect of our theoretical results on real and synthetic data (Section 6) and closing with a brief discussion (Section 7).

Notation: We say that a sequence of random variables converges almost surely to random variable , written a.s., if . We say a sequence of events occurs almost always almost surely (abbreviated a.a.a.s.) if with probability 1, occurs for at most finitely many . By the Borel-Cantelli lemma, implies occurs a.a.a.s. We write to denote the set of all (possibly weighted) graphs on vertices. Throughout, without loss of generality, we will assume that the vertex set is given by . For a positive integer , we will often use to denote the set . For a set , we will use to denote the set of all pairs of distinct elements of . That is, . For a function with domain , we write to denote the restriction of to the set .

1.1 Background

Stochastic block model random graphs offer a theoretically tractable model for graphs with latent community structure (Rohe et al., 2011; Sussman et al., 2012; Bickel and Chen, 2009), and have been widely used in the literature to model community structure in real networks (Airoldi et al., 2008; Karrer and Newman, 2011). While stochastic block models can be too simplistic to capture the eccentricities of many real graphs, they have proven to be a useful, tractable surrogate for more complicated networks (Airoldi et al., 2013; Olhede and Wolfe, 2014).

Definition 1.

Let and be positive integers and let be a vector of positive integers with . Let and let be symmetric. A -valued random graph is an instantiation of a conditional Stochastic Block Model, written , if

  • The vertex set is partitioned into blocks, of cardinalities for ;

  • The block membership function is such that for each , ;

  • The symmetric block communication matrix is such that for each , there is an edge between vertices and with probability , independently of all other edges.

Without loss of generality, let be the block of interest for vertex nomination. For each , we further decompose into (with ), where the vertices in have their block membership observed a priori. We call the vertices in seed vertices, and let . We will denote the set of nonseed vertices by , and for all , let and Throughout this paper, we assume that the seed vertices are chosen uniformly at random from all possible subsets of of size . The task in vertex nomination is to leverage the information contained in the seed vertices to produce a nomination list (i.e., an ordering of the vertices in ) such that the vertices in concentrate at the top of the list. We note that, strictly speaking, a nomination list is also a function of the observed graph , a fact that we suppress for ease of notation. We measure the efficacy of a nomination scheme via average precision

(1)

AP ranges from to , with a higher value indicating a more effective nomination scheme: indeed, indicates that the first vertices in the nomination list are all from the block of interest, and indicates that none of the top-ranked vertices are from the block of interest. Letting denote the -th harmonic number, with the convention that , we can rearrange (1) as

from which we see that the average precision is simply a convex combination of the indicators of correctness in the rank list, in which correctly placing an interesting vertex higher in the nomination list (i.e., with rank close to 1) is rewarded more than correctly placing an interesting vertex lower in the nomination list.

In Fishkind et al. (2015), three vertex nomination schemes are presented in the context of stochastic block model random graphs: the canonical vertex nomination scheme, , which is suitable for small graphs (tens of vertices); the likelihood maximization vertex nomination scheme, , which is suitable for small to medium graphs (up to thousands of vertices); and the spectral partitioning vertex nomination scheme, , which is suitable for medium to very large graphs (up to tens of millions of vertices). In the stochastic block model setting, the canonical vertex nomination scheme is provably optimal: under mild model assumptions, for any vertex nomination scheme  (Fishkind et al., 2015), where the expectation is with respect to a -valued random graph and the selection of the seed vertices. Thus, the canonical method is the vertex nomination analogue of the Bayes classifier, and this motivates the following definition:

Definition 2.

Let . With notation as above, a vertex nomination scheme is consistent if

In our proofs below, where we establish the consistency of two nomination schemes, we prove a stronger fact, namely that a.a.a.s. We prefer the definition of consistency given in Definition 2 since it allows us to speak about the best possible nomination scheme even when the model is such that .

In Fishkind et al. (2015), it was proven that under mild assumptions on the stochastic block model underlying , we have

from which the consistency of follows immediately. The spectral nomination scheme proceeds by first -means clustering the adjacency spectral embedding (Sussman et al., 2012) of , and then nominating vertices based on their distance to the cluster of interest. Consistency of is an immediate consequence of the fact that, under mild model assumptions on the underlying stochastic block model, -means clustering of the adjacency spectral embedding of perfectly clusters the vertices of a.a.a.s. (Lyzinski et al., 2014b).

Bickel and Chen (2009) proved that maximum likelihood estimation provides consistent estimates of the model parameters in a more common variant of the conditional stochastic block model of Definition 1, namely, in the stochastic block model with random block assignments:

Definition 3.

Let and be as above. Let be a probability vector over outcomes and let be a random function. A -valued random graph is an instantiation of a Stochastic Block Model with random block assignments, written , if

  • For each vertex and block , independently of all other vertices, the block assignment function assigns to block with probability (i.e., );

  • The symmetric block communication matrix is such that, conditioned on , for each there is an edge between vertices and with probability , independently of all other edges.

A consequence of the result of Bickel and Chen (2009) is that the maximum likelihood estimate of the block assignment function perfectly clusters the vertices a.a.a.s. in the setting where . This bears noting, as our maximum likelihood vertex nomination schemes and (defined below in Section 2) proceed by first constructing a maximum likelihood estimate of the block membership function , then ranking vertices based on a measure of model misspecification. Extending the results from Bickel and Chen (2009) to our present framework—where we consider and to be known (or errorfully estimated via seeded vertices) as opposed to parameters to be optimized over in the likelihood function as done in Bickel and Chen (2009)—is not immediate.

We note the recent result by Newman (2016), which shows the equivalence of maximum-likelihood and maximum modularity methods in a special case of the stochastic block model when is known. Our results, along with this recent result, immediately imply a consistent maximum modularity-based vertex nomination scheme under that special-case model.

2 Graph Matching and Maximum Likelihood Estimation

Consider with associated adjacency matrix , and, as above, denote the set of seed vertices by . Define the set of feasible block assignment functions

The maximum likelihood estimator of is any member of the set of functions

(2)

where the second equality follows from independence of the edges and splitting the edges in the sum according to whether or not they are incident to a seed vertex. We can reformulate (2) as a graph matching problem by identifying with a permutation matrix :

Definition 4.

Let and be two -vertex graphs with respective adjacency matrices and . The Graph Matching Problem for aligning and is

where is defined to be the set of all permutation matrices.

Incorporating seed vertices (i.e., vertices whose correspondence across and is known a priori) into the graph matching problem is immediate (Fishkind et al., 2012). Letting the seed vertices be (without loss of generality) in both graphs, the seeded graph matching (SGM) problem is

(3)

where

Setting to be the log-odds matrix

(4)

observe that the optimization problem in Equation (2) is equivalent to that in (3) if we view as encoding a weighted graph. Hence, we can apply known graph matching algorithms to approximately find .

Decomposing and as

and using the fact that is unitary, the seeded graph matching problem is equivalent (i.e., has the same minimizer) to

Thus, we can recast (2) as a seeded graph matching problem so that finding

is equivalent to finding

(5)

as we shall explain below.

With defined as in (4), we define

Define an equivalence relation on via iff there exists a such that ; i.e.,

Let denote the set of equivalence classes of under equivalence relation . Solving (2) is equivalent to solving (5) in that there is a one-to-one correspondence between and : for each there is a unique (with associated permutation ) such that ; and for each (with the permutation associated with given by ), it holds that .

2.1 The Vertex Nomination Scheme

The maximum likelihood (ML) vertex nomination scheme proceeds as follows. First, the SGM algorithm (Fishkind et al., 2012; Lyzinski et al., 2014a) is used to approximately find an element of , which we shall denote by . Let the corresponding element of be denoted by . For any such that , define as

i.e., agrees with except that and have their block memberships from switched in . For such that , define

where, for each , the likelihood is given by

A low/high value of is a measure of our confidence that is/is not in the block of interest. For such that , define

A low/high value of is a measure of our confidence that is/is not in the block of interest. We are now ready to define the maximum-likelihood based nomination scheme :

Note that in the event that an argmin (or argmax) above contains more than one element, the order in which these elements is nominated should be taken to be uniformly random.

Remark 5.

In the event that is unknown a priori, we can use the block memberships of the seeds (assumed to be chosen uniformly at random from ) to estimate the edge probability matrix as

and

The plug-in estimate of , given by

can then be used in place of in Eq. (5). If, in addition, is unknown, we can estimate the block sizes as

for each , and these estimates can be used to determine the block sizes in .

2.2 The Vertex Nomination Scheme

Graph matching is a computationally difficult problem, and there are no known polynomial time algorithms for solving the general graph matching problem for simple graphs. Furthermore, if the graphs are allowed to be weighted, directed, and loopy, then graph matching is equivalent to the NP-hard quadratic assignment problem. While there are numerous efficient, approximate graph matching algorithms (see, for example, Vogelstein et al., 2014; Fishkind et al., 2012; Zaslavskiy et al., 2009; Fiori et al., 2013, and the references therein), these algorithms often lack performance guarantees.

Inspired by the restricted-focus seeded graph matching problem considered in Lyzinski et al. (2014a), we now define the computationally tractable restricted-focus likelihood maximization vertex nomination scheme . Rather than attempting to quickly approximate a solution to the full graph matching problem as in Vogelstein et al. (2014); Fishkind et al. (2012); Zaslavskiy et al. (2009); Fiori et al. (2013), this approach simplifies the problem by ignoring the edges between unseeded vertices. An analogous restriction for matching simple graphs was introduced in Lyzinski et al. (2014a). We begin by considering the graph matching problem in Eq. (5). The objective function

consists of two terms: which seeks to align the induced subgraphs of the nonseed vertices; and which seeks to align the induced bipartite subgraphs between the seed and nonseed vertices. While the graph matching objective function, Eq. (5), is quadratic in , restricting our focus to the second term in Eq. (5) yields the following linear assignment problem

(6)

which can be efficiently and exactly solved in time with the Hungarian algorithm (Kuhn, 1955; Jonker and Volgenant, 1987). We note that, exactly as was the case of and , finding is equivalent to finding

in that there is a one-to-one correspondence between and .

The scheme proceeds as follows. First, the linear assignment problem, Eq. (6), is exactly solved using, for example, the Hungarian algorithm (Kuhn, 1955) or the path augmenting algorithm of Jonker and Volgenant (1987), yielding . Let the corresponding element of be denoted by For such that , define

where, for each , the restricted likelihood is defined via

As with a low/high value of is a measure of our confidence that is/is not in the block of interest. For such that , define

As before, a low/high value of is a measure of our confidence that is/is not in the block of interest. We are now ready to define :

Note that, as before, in the event that the argmin (or argmax) in the definition of contains more than one element above, the order in which these elements are nominated should be taken to be uniformly random.

Unlike the restricted focus scheme is feasible even for comparatively large graphs (up to thousands of nodes, in our experience). However, we will see in Section 6 that the extra information available to —the adjacency structure among the nonseed vertices—leads to superior precision in the nomination lists as compared to . We next turn our attention to proving the consistency of the and schemes.

3 Consistency of and

In this section, we state theorems ensuring the consistency of the vertex nomination schemes (Theorem 6) and (Theorem 8). For the sake of expository continuity, proofs are given in the Appendix. We note here that in these Theorems, the parameters of the underlying block model are assumed to be known a priori. In Section 4, we prove the consistency of and in the setting where the model parameters are unknown and must be estimated, as in Remark 5.

Let with associated adjacency matrix , and let be defined as in (4). For each (with associated permutation ) and , define

to be the number of vertices in mapped to by , and for each define

Before stating and proving the consistency of , we first establish some necessary notation. Note that in the definitions and theorems presented next, all values implicitly depend on , as is allowed to vary in . Let be the set of distinct entries of , and define

(7)
(8)
Theorem 6.

Let and assume that

  • ;

  • is such that for all with ,

  • For each , , and ;

  • .

Then it holds that , and is a consistent nomination scheme.

A proof of Theorem 6 is given in the Appendix.

Remark 7.

There are numerous assumptions akin to those in Theorem 6 under which we can show that is consistent. Essentially, we need to ensure that if we define , then is summably small, from which it follows that with high probability, which is enough to ensure the desired consistency of .

Consistency of holds under similar assumptions.

Theorem 8.

Let . Under the following assumptions

  • ;

  • is such that for all with ,

  • For each , , and ;

  • ;

it holds that , and is a consistent nomination scheme.

A proof of this Theorem can be found in the Appendix.

4 Consistency of and When the Model Parameters are Unknown

If is unknown a priori, then the seeds can be used to estimate as , and as for each . In this section, we will prove analogues of the consistency Theorems 6 and 8 in the case where and are estimated using seeds. In Theorems 9 and 10 below, we prove that under mild model assumptions, both and are consistent vertex nomination schemes, even when the seed vertices form a vanishing fraction of the graph.

We now state the consistency result analogous to Theorem 6, this time for the case where we estimate and . The proof can be found in the Appendix.

Theorem 9.

Let be a fixed, symmetric, block probability matrix satisfying

  • is fixed in ;

  • is such that for all with ,

  • For each and ;

  • and defined as in (7) and (8) are fixed in .

Suppose that the model parameters of are estimated as in Remark 5 yielding log-odds matrix estimate and estimated block sizes . If is run on and using the block sizes given by , then under the above assumptions it holds that , and is a consistent nomination scheme.

We now state the analogous consistency result to Theorem 8 when we estimate and . The proof is given in the Appendix.

Theorem 10.

Let be a fixed, symmetric, block probability matrix satisfying

  • is fixed in ;

  • is such that for all with ,

  • For each s.t. and ;

  • and ;

  • and defined at (7) and (8) are fixed in .

Suppose that the model parameters of are estimated as in Remark 5 yielding and estimated block sizes . If is run on and using block sizes given by , then under the above assumptions it holds that and is a consistent nomination scheme.

The two preceding theorems imply that vertex nomination is possible even when the number of seeds is a vanishing fraction of the vertices in the graph. Indeed, we find that in practice, accurate nomination is possible even with just a handful of seed vertices. See the experiments presented in Section 6.

5 Model Generalizations

Network data rarely appears in isolation. In the vast majority of use cases, the observed graph is richly annotated with information about the vertices and edges of the network. For example, in a social network, in addition to information about which users are friends, we may have vertex-level information in the form of age, education level, hobbies, etc. Similarly, in many networks, not all edges are created equal. Edge weights may encode the strength of a relation, such as the volume of trade between two countries. In this section, we sketch how the and vertex nomination schemes can be extended to such annotated networks by incorporating edge weights and vertex features. To wit, all of the theorems proven above translate mutatis mutandis to the setting in which is a drawn from a bounded canonical exponential family stochastic block model. Consider a single parameter exponential family of distributions whose density can be expressed in canonical form as

We will further assume that has bounded support. We define

Definition 11.

A -valued random graph is an instantiation of a bounded, canonical exponential family stochastic block model, written , if

  • The vertex set is partitioned into blocks, with sizes for ;

  • The block membership function is such that for each , ;

  • The symmetric block parameter matrix is such that the , are independent, distributed according to the density

Note that the exponential family density is usually written as , where is the log-normalization function. We have made the notational substitution to avoid confusion with the adjacency matrix . If , analogues to Theorems 689 and 10 follow mutatis mutandis if we use seeded graph matching to match to ; i.e., under analogous model assumptions, and are both consistent vertex nomination schemes when the model parameters are known or estimated via seeds. The key property being exploited here is that is a nondecreasing function of . We expect that results analogous to Theorems 689 and 10 can be shown to hold for more general weight distributions as well, but we do not pursue this further here.

Incorporating vertex features into and is immediate. Suppose that each vertex is accompanied by a -dimensional feature vector . The features could encode additional information about the community structure of the underlying network; for example, if then perhaps where the parameters of the normal distribution vary across blocks and are constant within blocks. This setup, in which vertices are “annotated” or “attributed” with additional information, is quite common. Indeed, in almost all use cases, some auxiliary information about the graph is available, and methods that can leverage this auxiliary information are crucial. See, for example, Yang et al. (2013); Zhang et al. (2015); Newman and Clauset (2016); Franke and Wolfe (2016) and citations therein. We model vertex features as follows. Conditioning on , the feature associated to is drawn, independently of and of all other features , from a distribution with density . Define the feature matrix via

where represents the features of the seed vertices in , and the features of the nonseed vertices in . For each block , let be an estimate of the density , and create matrix given by

Then we can incorporate the feature density into the seeded graph matching problem in (5) by adding a linear factor to the quadratic assignment problem:

(9)

The factor allows us to weight the features encapsulated in versus the information encoded into the network topology of .

Vertex nomination proceeds as follows. First, the SGM algorithm (Fishkind et al., 2012; Lyzinski et al., 2014a) is used to approximately find an element of in Eq. (9), which we shall denote by . Let the block membership function corresponding to be denoted . For such that , define

where, for each , the likelihood is given by

where, for , is the estimated density of the -th block features. Note that here we assume that the feature densities must be estimated, even when the matrix is known. A low/high value of is a measure of our confidence that is/is not in the block of interest. For such that , define

A low/high value of is a measure of our confidence that is/is not in the block of interest. The nomination list produced by is then realized via: