Graphical Models: An Extension to Random Graphs, Trees, and Other Objects

# Graphical Models: An Extension to Random Graphs, Trees, and Other Objects

## Abstract

In this work, we consider an extension of graphical models to random graphs, trees, and other objects. To do this, many fundamental concepts for multivariate random variables (e.g., marginal variables, Gibbs distribution, Markov properties) must be extended to other mathematical objects; it turns out that this extension is possible, as we will discuss, if we have a consistent, complete system of projections on a given object. Each projection defines a marginal random variable, allowing one to specify independence assumptions between them. Furthermore, these independencies can be specified in terms of a small subset of these marginal variables (which we call the atomic variables), allowing the compact representation of independencies by a directed graph. Projections also define factors, functions on the projected object space, and hence a projection family defines a set of possible factorizations for a distribution; these can be compactly represented by an undirected graph.

The invariances used in graphical models are essential for learning distributions, not just on multivariate random variables, but also on other objects. When they are applied to random graphs and random trees, the result is a general class of models that is applicable to a broad range of problems, including those in which the graphs and trees have complicated edge structures. These models need not be conditioned on a fixed number of vertices, as is often the case in the literature for random graphs, and can be used for problems in which attributes are associated with vertices and edges. For graphs, applications include the modeling of molecules, neural networks, and relational real-world scenes; for trees, applications include the modeling of infectious diseases and their spread, cell fusion, the structure of language, and the structure of objects in visual scenes. Many classic models can be seen to be particular instances of this framework.

## 1Introduction

In problems involving the statistical modeling of a collection of random variables (i.e., a multivariate random variable), the use of invariance assumptions is often critical for practical learning and inference. A graphical model is a framework for such problems based on conditional independence, a fundamental invariance for these variables; this framework has found wide-spread use because independence occurs naturally in many problems, and is often specifiable by practitioners. Furthermore, independence assumptions can be made at varying degrees (for many invariances, this is not the case), thus creating a range of model complexities, and allowing practitioners to adjust models to a given problem.

In this work, we consider an extension of graphical models from multivariate random variables to other random objects such as random graphs and trees. To do this, core concepts from graphical models must be abstracted, forming a more general formulation; in this formulation, graphical models can be applied to any object that has, loosely speaking, a structure allowing a hierarchical family of projections on it. Each projection in this family defines a marginal random variable, allowing one to specify independence assumptions between them, and further, allowing a graph to represent these independencies (where vertices correspond to atomic variables). This projection family also defines, for distributions, a family of factors, allowing one to specify general factorizations, and further, also represent them compactly by a graph. A projection family must satisfy certain basic properties in order for the corresponding variables to be consistent with each other.

In the first part of this work, we examine models for random graphs, the problem that originally motivated this investigation. Applying graphical models to them results in a general framework, applicable to problems in which graphs have complicated edge structures. These models need not be conditioned on a fixed number of vertices, as is often the case in the literature, and can be used for problems in which graphs have attributes associated with their vertices and edges. The focus of this work is on problems in which the number of vertices can vary. Some examples of graphs that these models are applicable to are shown in Figures Figure 6 and Figure 7. This work makes no contribution to the traditional setting of random graphs in which the vertex set is fixed; the formulation presented here is unnecessary in that setting.

After investigating graphical models for graphs, we consider their application to trees, a special type of graph used in many real-world problems. As with graphs, this results in models applicable to a broad range of problems, including those in which trees have complex structures and attributes. In the approach taken in most of the literature, probabilities are placed on trees based on how a tree is incrementally constructed (e.g., from a branching process or grammar). Using graphical models, this approach may be extended, allowing distributions to be defined based on how trees are deconstructed into parts. The benefit of this graphical model approach is that one can make well-defined distributions that have complex dependencies; in contrast, it is often intractable to define distributions over, for example, context-sensitive grammars.

In the last part of this work, we define some consistency and completeness conditions for projection families. These conditions on projections ensure the consistency of their corresponding random variables (i.e., they form a family of marginal variables), which in turn, allows graphical models to be directly defined in terms of projection families. In this formulation, graphical models may be loosely thought of as a modeling framework based on independence assumptions between the parts of an object, given the object is compositional. An object is compositional if: (a) it is composed of parts, which in turn, are themselves composed of parts, etc.; and (b) a part can be a member of multiple larger parts. Objects such as vectors, graphs, and trees, are compositional; in more applied settings, objects such as words and sentences, people, and real-world scenes, are compositional as well. Graphical models are naturally suited to the modeling of these objects.

### 1.1Random Graphs

A graph is a mathematical object that is able to encode relational information, and can be used to represent many entities in the world such as molecules, neural networks, and real-world scenes. An (undirected) graph is composed of a finite set of objects called vertices, and for each pair of vertices, specifies a binary value. If this binary value is positive, there is said to be an edge between that pair of vertices. In most applications, graphs have attributes associated with their vertices and edges; we will refer to attributed graphs simply as graphs in this work. (We make more formal definitions in Section 2.) A random graph is a random variable that maps into a set of graphs. In this section, we give a brief overview of random graph models in the literature, and discuss some of their shortcomings, motivating our work.

#### Literature

The most commonly studied random graph model is the Erdős-Rényi model ([17], [25]). This is a model for conditional distributions in which, for a given set of vertices, a distribution is placed over the possible edges. It makes the invariance assumption that, for any two vertices, the probability of an edge between them is independent of the other edges in the graph, and further, this probability is the same for all edges. This classic model, due to its simplicity, is conducive to mathematical analysis; its asymptotic behavior (i.e, its behavior as the number of vertices becomes large) has been researched extensively ([5], [40]).

There are many ways in which the Erdős-Rényi model can be extended. One such extension is the stochastic blockmodel [38]. This model is for conditional distributions over the edges, given vertices, where each vertex has a label (e.g., a color) associated with it. Similar to the Erdős-Rényi model, for any two vertices, the probability of an edge between them is independent of the other edges in the graph; unlike the the Erdős-Rényi model, this probability depends on the labels of those two vertices.

An extension of the stochastic blockmodel is the mixed membership stochastic blockmodel [1]. In this model, instead of associating each vertex with a fixed label, each vertex is associated with a probability vector over the possible labels. Given a set of vertices (and their label probability vectors), a set of edges can be sampled as follows: for each pair of vertices, first sample their respective labels, then sample from a Bernoulli distribution that depends on these labels. Another extension of the stochastic blockmodel is the latent space model [36], where instead of associating vertices with labels from a finite set, they are instead associated with positions in a Euclidean space; given the position of two vertices, the probability of an edge between them only depends on their distance.

A general class of random graph models, of which the above models fall within, is the exponential family ([37], [53], [61]). A well-known example is the Frank and Strauss model [19], also a model for conditional distributions, specifying the probability of having some set of edges, given vertices. Since the randomness is only over the edges, a graphical model can be applied in which there is a random variable for each pair of vertices, specifying the presence or absence of an edge. These random variables are conditionally independent, in this model, if they do not share a common vertex.

#### Other Literature

In this section, we review models from outside the mainstream random graph community that were designed for graphs that vary in size and have complicated attributes. One of the first such models was developed by Ulf Grenander under the name pattern theory ([29], [27], [28]). This work was motivated by the desire to formalize the concept of a pattern within a mathematical framework. A large collection of natural and man-made patterns is shown in [26]. Examples range from textures to leaf shapes to human language. In each of these examples, every particular instance of the given pattern can be represented by a graph. These instances have natural variations, and so the mathematical framework for describing these patterns is probabilistic, i.e. a random graph model. The model developed was based on applying Markov random fields to graphs. Learning and inference are often difficult in this model, limiting its practical use.

Later, random graph models were developed within the field of relational statistical learning. In particular, techniques such as Probabilistic Relational Models [24], Relational Markov Networks [63], and Probabilistic Entity-Relationship Models [35], were specifically designed for modeling entities that are representable as graphs. These models specify conditional distributions, applying graphical models in which: (1) for each vertex, there is a random variable representing its attributes; and (2) for each pair of vertices, there is a random variable representing their edge attributes. (This is an approach similar to the one taken in the Frank and Strauss model).

#### Issues

Suppose we want to learn a distribution over some graph space. This distribution cannot be directly modeled with graphical models because these were designed for multivariate random variables (with a fixed number of components). To avoid this issue, most random graph models in the literature transform the problem into one in which graphical models can be applied. This is done by only modeling a selected set of conditional distributions, for example, the set of distributions in which each is conditioned on some number of vertices. Aside from the fact that many applications simply require full distributions, problems with this approach include: (1) there are complicated consistency issues; a distribution may not exist that could produce a given set of conditional distributions; and (2) this partial modeling, loosely speaking, cannot capture important structures in distributions (e.g., there may be invariances within a full distribution that are difficult to encode within conditional distributions). To correct these issues, graphical models (for multivariate random variables) cannot be used for this problem; we need statistical models specifically designed for general graph spaces. Suppose we have a graph space in which graphs may differ in their order (i.e., graphs in this space may vary in their number of vertices); in this work, we want to develop distributions over this type of space.

In addition, we want models that are applicable to problems in which: (a) graphs have complex edge structures; and (b) graphs have attributes associated to their vertices and edges. To handle these problems, expressive models are necessary (i.e., models containing a large set of distributions). To make learning feasible in these models, it becomes imperative to specify structure in them as well.

#### Structure

To specify structure in random graph distributions, we look to the standard methods used in multivariate random variables for insight. Suppose we have a random variable taking values in . In general, its distribution has parameters that need to be specified. If the value of is not small, learning this number of parameters, in most real-world problems, is infeasible; hence, the need to control complexity. This has led to the wide-spread use of graphical models ([46], [42], [65]), a framework that uses factorization to simplify distributions. In this framework, joint distributions are specified by simpler functions, and more specifically, the probability of any is uniquely determined by a product of functions over subsets of its components.

Now, suppose we have a random graph taking values in some finite graph space . In general, its distribution has parameters that need to be specified, and again, clearly there is a need to control complexity. Similar to the above graphical models, we can simplify distributions through the use of factorization: the probability of any graph can be uniquely determined by a product of simpler functions over its subgraphs. Thus, we can create a general framework for random graphs analogous to that of graphical models. Indeed, just as graphs can be used to represent the factorization in graphical models, graphs can also be used to represent the factorizations in random graphs. These ideas are explored in Section 2.

### 1.2Random Trees

A tree is a special type of graph used in many real-world problems. Like graphs, random tree models range from simplistic ones, amenable to asymptotic analysis, to complex ones, more suited to problem domains with smaller, finite trees. We now briefly review models in the literature.

#### Literature

A classic random tree model is the Galton-Watson model ([66], [48], [14], [31]), where trees are incrementally constructed: beginning with the root vertex, the number of its children is sampled according to some distribution; for each of these vertices, the number of its children is sampled according to the same distribution, and so on. The literature on these models is vast, most focusing on the probability of extinction and the behavior in the limit. These models are often used, for example, in the study of extinction [32], and the spread of infectious diseases [7].

An extension of the Galton-Watson model is the multi-type Galton Watson model ([57], [50]), in which each vertex now has a label from some finite set. As before, trees are incrementally constructed; for a given vertex, the number of its children and their labels is now sampled according to some conditional distribution (conditioned on the label of the parent).

In problems in which vertices have relatively complex labels, often a grammar is used to specify which trees are valid (i.e., used to define the tree space). These grammars produce trees by production rules, which may be thought of as functions that take a tree and specify a larger one; beginning with the empty tree, trees are incrementally built by the iterative application of these rules. In a context-free grammar [10], the production rules are functions that depend only on one (leaf) vertex in a given tree, and specify its children and their attributes. Distributions can be defined over trees in this grammar by associating a probability with each production rule.

A context-sensitive grammar is an extension of a context-free grammar in which production rules are functions that depend both on a given leaf vertex and certain vertices neighboring this leaf as well. It is well-known that the approach of associating probabilities to production rules does not extend to context-sensitive grammars (i.e., does not produce well-defined distributions in this case); to make distributions for these grammars, very high-order models are required.

There are many applications of random trees with attributes: in linguistics, they are used to describe the structure of sentences and words in natural language [10]; in computer vision, they are used to describe the structure of objects in scenes ([41], [69]); and in genetics, they are used in the study of structure in RNA molecules ([8], [15], [16]).

In this work, we consider a graphical model approach for random trees; by decomposing a random tree into its marginal random (tree) variables, it becomes tractable to make well-defined tree distributions that are, loosely speaking, context-sensitive. Since trees are graphs, one could model them by applying the same framework that we develop for general graphs. However, it is beneficial to instead use models that are tuned to the defining properties of trees.

### 1.3Outline

In Section 2, we examine the common compositional structure within multivariate random variables and random graphs, allowing graphical models to be applied to each. The main ideas for extending graphical models to other objects are outlined in this section. In Section 3, we explore the modeling of random trees with graphical models. In Section 4, we provide a formulation for general random objects, and in Section 5, we illustrate the application of these models with some examples, focusing on random graphs. Finally, we conclude with a discussion in Section 6.

## 2Random Graphs

In this section, we present a general class of models for random graphs which can be used for creating complex distributions (e.g., distributions that place significant mass on graphs with complicated edge structures). We begin by defining a canonical projection family based on projections that take graphs to their subgraphs. These projections define a consistent family of marginal random (graph) variables, allowing us to specify conditional independence assumptions between them, and in turn, apply Bayesian networks (over the marginal variables that are atomic). Next, we define, using these same graph projections, a Gibbs form for graph distributions, allowing us to specify general factorizations, and in turn, apply Markov random fields. Finally, we consider partially directed models (also known as chain graph models), a generalization of Markov random fields and Bayesian networks; these models are important for random graphs because, as we will discuss, they avoid certain drawbacks that these former models have for this problem, while maintaining their advantages.

### 2.1Graphs

Suppose we have a vertex space and a edge space , and for simplicity, assume the vertex space is finite. We define a graph to be a couple of the form , where is a set of vertices and is a function assigning an edge value to every pair of vertices:

Hence, every vertex is unique, i.e. no two vertices can share the same value in . We assume the edge space contains a distinguished element that represents the ‘absence’ of an edge (e.g. the value ). If a graph has no vertices, i.e., , we will denote it by and refer to it as the empty graph. For simplicity, we assume there are no self loops. That is, there are no edges between a vertex and itself (i.e., for all ).

In most real-world applications, graphs have attributes associated with their vertices and edges; in this case, attributes can be incorporated into the vertex space and edge space , or alternatively, graphs can be defined to be attributed. For the presentation of the random graph models, we will proceed in the simpler setting though, deferring attributed graphs and other variations to Section ?. We consider more examples in Section 5.

### 2.2Marginal Random Graphs

Suppose we have a graph space that we want to define a distribution over. To do this, some basic probabilistic concepts need to be developed; in this section, we define, for a random graph, a family of marginal random (graph) variables. These marginal variables are defined using projections on the graph space, and hence, we require that random graphs take values in graph spaces that are projectable.

Let’s begin by defining an induced subgraph. For a graph , let the subgraph induced by a subset of its vertices be the graph , where is the restriction of ; we let denote the subgraph of induced by . For a given graph, its subgraphs may be thought of as its components or parts, and are fundamental to its statistical modeling.

A graph space is projectable if, for every graph existing in this space, its subgraphs also exist in it. For a graph , let denote the set of all its subgraphs:

where is the set of vertices of graph . This set contains, for example, subgraphs corresponding to individual vertices in (i.e. the subgraphs where and ), and the subgraphs corresponding to pairs of vertices in (i.e. the subgraphs where and ). Now, we may define a projectable graph space:

Henceforth, we assume that every graph space is projectable. Now, we may define graph projections:

The projection maps graphs to their induced subgraphs based on the intersection of their vertices with the vertices . That is, for a graph , if there are no vertices in this intersection (i.e., , where ), then gets projected to the empty graph; if there is an intersection (i.e. , where ), then gets projected to its subgraph induced by the vertices in this intersection.

This projection has the property that the image of a projectable graph space is also a projectable space. That is, if the domain is projectable, then for each projection , the codomain is also projectable. This property is useful because it allows us to define a consistent set of marginal random variables. Suppose we have a distribution over a countable graph space ; then the distribution for a marginal random variable taking values in is defined as:

It can be verified that this defines a valid probability distribution, i.e., that

and further, that this set of distributions (i.e., the set ) is consistent, i.e., for all such that , we have that and are consistent:

for all .

### 2.3Independence

The marginal random variables for random graphs defined in the previous section allow us to use the standard definitions of independence and conditional independence for random variables. For convenience, we repeat the definition of independence here, using the notation for random graphs. Suppose we have a vertex space and an edge space , and let be a graph space with respect to them. Define independence as follows:

Similarly, conditional independence for random graphs can defined using the standard definitions as well, which we do not repeat here. These definitions suggest methods for specifying structure in distributions:

### 2.4Bayesian Networks

In graphical models, graphs are used to represent the structure within distributions; we will refer to these as structure graphs to avoid confusion. For a graph with a binary edge function, two vertices are said to have a directed edge from to (denoted by ) if and , and are said to have an undirected edge between them (denoted ) if . The vertex is a parent of vertex if , and vertices and are neighbors if . The set of parents of is denoted by and the set of neighbors by . In this section, we consider Bayesian networks, a modeling framework based on conditional independence assumptions, specified in structure graphs with directed edges [51].

#### Structure Graphs

Let’s begin by considering Bayesian networks for multivariate random variables; suppose we have a random variable taking values in , and a structure graph with vertices and a binary edge function of the form . Further, assume this structure graph has directed edges and is acyclic; a distribution over is said to factor according to this structure graph if it can be written in the form:

where is the projection of onto its components in the set .

Now consider Bayesian networks for random graphs; suppose we have a random graph taking values in a graph space , and a structure graph with vertices , where each , and a binary edge function of the form . Further, assume the structure graph is directed and acyclic; a distribution over factorizes according to this structure graph if it can be written in the form:

where, for , we have , and where, recall is the projection of onto the vertices .

#### Atomic Variables

In the previous section, the main difference between the graphical model for multivariate random variables and for random graphs was in the marginal variables used in the structure graph in each case (i.e., the variables in which vertices in the structure graph correspond). In this section, we consider in more detail the subset of variables used, for a given random object, by graphical models in their structure graphs.

Suppose we have a random graph taking values in a projectable graph space . The canonical set of projections on this graph space defines a set of marginal random variables, and a projection in this set such that, loosely speaking, no other projection further projects downward, defines an atomic variable. Informally, a projection is atomic (with respect to a finite projection family) if: (a) there does not exist a projection in this family that projects to a subset of its image; or (b) if there are projections in this family that project to a subset of its image, then this set loses information (i.e., is not a function of these projections). We defer more formal definitions to Section 4.1. The second condition ensures that any object projected by the set of atomic projections can be reconstructed. We will call a marginal variable atomic if it corresponds to an atomic projection.

For random graphs, the atomic projections have the form or (i.e., loosely speaking, the projections to some vertex or edge), and the non-atomic projections have the form where (i.e., the projections to larger vertex sets). Hence, for a random graph , the atomic variables are , where ; these variables can be used as a representation of the random graph, and graphical models specify structure in terms of them (i.e., the vertices in structure graphs correspond to these variables).

### 2.5Gibbs Distribution

In this section, we define a Gibbs form for random graphs based on a canonical factorization; this factorization is determined by the canonical projections, the projection family taking graphs to their subgraphs. For a graph , let denote the set of all subgraphs of of order :

where, recall is the set of all of induced subgraphs of , and where denotes the vertices of graph . Hence, the set contains graphs having a single vertex, the set contains graphs having two vertices, and so on.

For this section, let the vertex space be countable, and for any graph, assume its vertex set is finite. We can define a Gibbs distribution for a countable graph space as follows:

A graph space need not be countable (depending on ), but for ease of exposition, we assumed so here. We give some examples in which classic models are expressed in this form.

We now define a positivity condition for distributions; this will allow us to make a statement about the universality of the Gibbs representation.

For a given graph , define

where and where . Using the Mobius formula, we can write

where the positivity condition is required for the validity of the second equation. Note that only depends on (not on the rest of ), so it can be renamed ; letting , we have:

This theorem shows that distributions can be expressed in such a way that the probability of a graph is a function of only its induced subgraphs; that is, statistical models need not include (more formally, set to zero) the value of potentials that involve vertices that are absent from a given input graph. Henceforth, we return to assuming vertex spaces are finite (since, in our formulation, graphical models are limited to finite projection families (see Section 4)).

### 2.6Markov Random Fields

In the previous section, we defined a Gibbs distribution for random graphs, a universal representation (Theorem ?) based on a general factorization. In this section, we consider Markov random fields, a graphical model that specifies structure in distributions based on these factorizations ([43], [21], [12]).

Consider Markov random fields for multivariate random variables: suppose we have a random variable taking values in , where each is finite. To define a distribution over , we will assume it equals some product of simpler functions (i.e., functions that have smaller domains than ). To define these simpler functions, we use projections of the form , where and , and take elements in to their components. Using these projections, we can define factors of the form , and a distribution factorizes over if it can be written as:

where , and where denote the power set of . Structure can be specified in this model by the choice of factors. For a given model, complexity can be reduced through the removal of factors (i.e., removing elements from the set ).

Now suppose we have a random graph taking values in . As was done in the multivariate case, we define the factorization of distributions over this graph space using a projection family; a distribution can be defined as a product of factors of the form , where, recall is a smaller graph space. A distribution factorizes over if it can be written as:

where , and where we are assuming if . As above, structure can be specified in this model through the choice of factors.

#### Cliques

We now consider the representation of the factorizations in the previous section in terms of an undirected structure graph; suppose we have a neighborhood function that is symmetric, where . In order for a neighborhood function to be valid (i.e., specify independence assumptions that are consistent in the sense that there exists a well-defined distribution that satisfies them), it must specify a direct dependency between any such that one is a subset of the other. That is, for all , we require that

A neighborhood function specifies the set of factors within a model based on its cliques, where cliques are defined as follows:

Hence, by the second condition, we have that each vertex and each pair of vertices are cliques. Let contain the vertex sets that correspond to cliques:

This set represents the set of factors to be used in a distribution (i.e., for each , we will assume there is a factor over this set of vertices). Hence, a Gibbs distribution with respect to a neighborhood function can be defined as follows:

Now that we have defined a Gibbs distribution with respect to a neighborhood function, let’s consider its connections to Markov properties and Markov distributions.

#### Markovity

A distribution is Markov if, loosely speaking, conditional probabilities only depend on local parts of the random object. Let’s consider Markovity for multivariate random variables. Suppose we have a random variable taking values in , and a (symmetric) neighborhood function . A distribution over is Markov with respect to the neighborhood function if, for all and for all , we have that:

where , and where each denotes the component of .

Now consider random graphs; let and be a vertex and edge space, respectively, and let be a graph space with respect to them. Further, let , and suppose we have a (symmetric) neighborhood function . Then, a distribution is Markov with respect to the neighborhood function if, for all and all , we have that:

where , and where . Thus, we define Markovity as follows:

We have that if a distribution is Gibbs with respect to some neighborhood function, then it is Markov with respect to it as well:

The reverse implication in the above proposition is not true (i.e., the Hammersley-Clifford theorem ([30], [3]) does not hold). A neighborhood function can specify more structure for a Markov distribution than for a Gibbs distribution; hence, one cannot specify (general) independence assumptions and then assume a Gibbs form. The reason is because the atomic variables have redundancy in them; a vertex variable is a function of an edge variable of the form . For a discussion on this issue, see Section 2.8. To avoid this drawback, but maintain the advantages offered by undirected models (in particular, the ability to express the probability of a graph in terms of only its subgraphs), we now consider partially directed models.

### 2.7Partially Directed Models

In this section, we briefly review chain graph models [47], which we will use in the modeling of random graphs. These models involve structure graphs that can have both directed and undirected edges, a generalization of Bayesian models and Markov random fields. The reason chain graph models are beneficial for random graphs is because they allow one to specify, loosely speaking, a Gibbs distribution over vertices, as well as a Gibbs distribution over edges, while avoiding the functional dependencies that are problematic. For these structure graphs, we will assume that all edges between vertex variables and edge variables are directed, and all other edges undirected.

In these models, structure graphs are required to be acyclic, where cycles are now defined as follows: a partially directed cycle is a sequence of distinct vertices in a graph, and a vertex , such that:

1. for all , either or , and

2. there exists a such that .

A chain graph is a graph in which there are no partially directed cycles. For a given chain graph, let the chain components be the partition of its vertices such that any two vertices and are in the same partition set if there exists a path between them that contains only undirected edges. In other words, is the partition that corresponds to the connected components of the graph after the directed edges have been removed.

A distribution over graph space factorizes according to a chain graph if it can be written in the form:

and further, we have that:

where is the set of cliques in the moralization of the graph , i.e., the undirected graph that results from adding edges between any unconnected vertices in and converting all directed edges into undirected edges, where

The factor normalizes the distribution:

### 2.8Discussion

We now take a step back and examine some of the design choices made in this section. Graphical models, from a high-level, may be thought of as a framework for modeling random objects based on the use of independence assumptions between the parts of the object. It is important that these independence assumptions be made, or can be made, between the smallest parts, those that cannot be decomposed into smaller ones. The reason, as we will discuss in this section, is that this makes the space of (possible) independence assumptions as large as possible, and hence allows the most structure to be specified within a graphical model.

#### Redundant Representations

The representation of a random object based on its atomic marginal variables can have redundancy in it; for example, a vertex variable is a function of an edge variable of the form . This redundancy may appear troublesome since, for example, it means the Hammersley-Clifford theorem cannot be used, preventing us from specifying independencies and then assuming a Gibbs form for distributions. We could remove the redundant variables (i.e., variables that are functions of other variables), and represent the random graph by only the random variables , a subset of the atomic variables. However, this approach is problematic since it also diminishes our ability to specify structure. Representations with redundancy have the advantage, compared to representations without redundancy, of providing a larger space of possible independence assumptions. We illustrate the concept with some examples:

These examples illustrate that representing a random object with atomic variables, even if there is redundancy in them, allows more invariances to be specified by a graphical model than would be possible without all of them. Although having a larger space of independence assumptions is not always beneficial - a practitioner cannot specify invariances between variables so low-level that they are uninterpretable - the specification of invariances involving vertices is natural when modeling random graphs, and so vertex variables should generally be included in any graphical model for this problem.

#### Graph Variations

In this section, we briefly describe other mathematical objects - variations on the definition of a graph - that may be useful for some problems; the graphical model framework discussed in this section can accommodate these objects in a straightforward way.

In the definition of graphs presented in Section 2.1, vertices were a subset of some vertex space , and hence each vertex has a unique value in this space. In some applications, graphs have attributes associated with their vertices, in which case, the vertices need only be unique on some component, for example a location component, and may otherwise have common attribute values. These graphs are referred to as attributed in the literature ([52], [39]). Suppose we have a finite vertex space , an edge space , and an attribute space . We define an attributed graph to be of the form , where is a set of vertices, is a function assigning an attribute value to each vertex, and is a function assigning an edge value to every pair of vertices:

Hence, every vertex in a graph has a unique value in , and the vertices may be thought of as indices for the variables . For example, if we let , then a graph may be thought of as some collection of variables of the form , where , as well as edges between them. The attribute space could be, for example, a finite set of labels or a Euclidean space (for specifying positions).

Graphs may be further generalized to allow higher-order edges, referred to as hypergraphs [2]. Suppose we have a finite vertex space and an edge space . Then, we can define a generalized graph to be of the form , where is a set of vertices, and each is a function assigning an edge value to every group of vertices:

Graphs with higher-order edges may be useful in problems in which interactions can be between multiple objects, and these interactions are not a function of the pairwise interactions ([68], [64]).

#### Projections

It is worth noting that if an attributed graph space is constrained to only graphs that: (a) contain the same set of vertices; and (b) have no edges, then the canonical graph projections (Definition ?), in essence, reduce to the component projections used with multivariate random variables. In this sense, the graph projections may be thought of as an extension of the component projections to graph spaces.

## 3Random Trees

In this section, we consider the statistical modeling of trees; since trees are a type of graph, the random graph models described in Section 2 could be used. However, it is beneficial to instead use models that are tuned to the defining structure of trees. If the vertices in trees are assumed to take a certain form, then the edges in trees are deterministic, given the vertices in it; as a result, the tree space and its modeling are simplified. In particular, with these assumptions about the vertex space, the atomic variables correspond to individual vertices (in contrast to the atomic variables in random graphs). Hence, in basic models, the vertices in structure graphs correspond to the vertices in trees, and in more complex models (e.g., with context-sensitive dependencies), the vertices in structure graphs correspond to the vertices in the vertex space.

We begin by considering Bayesian networks in which: (a) the directionality of edges (in the structure graph) are from root to leafs, which we refer to as branching models; and (b) models in which the directionality is the opposite, from leafs to root, which we refer to as merging models. The former is well-suited for problems in which there is a physical process such that, as time progresses, objects divide into more objects; most models in the literature are of this form. The latter model, in contrast, is well-suited for problems in which there is some initial set of objects, and as time progresses, these objects merge with each other.

In these types of causal problems, it is generally accepted that the directionality of edges in Bayesian networks should, if possible, correspond to the causality. In some applications, however, trees are not formed by an obvious causal mechanism, and one need not limit themself to either a branching or merging model. For example, consider trees that describe the structure of objects in scenes, where vertices correspond to objects (e.g., cars, trucks, tires, doors, etc.), and edges encode when an object is a subpart of another object ([41], [69]). These trees are representations of scenes, not formed by a clear time-dependent process. Hence, although distributions on these trees can be expressed using branching or merging models, they may not be expressible by them in a compact form, which is essential. In the last part of this section, we consider more general models that may be useful for these problems.

### 3.1Branching Models

In this section, we consider directed and partially-directed models for random trees in which the directed edges are from root to leaf. We first consider trees without attributes, then proceed to trees with them. To demonstrate the value of the graphical model approach to random trees, we contrast it with approaches based on grammars.

#### Trees

A tree is a graph that is connected and acyclic. A rooted tree is a tree that has a partial ordering (over its vertices) defined by distance from some designated vertex referred to as the root of the tree. Due to the structure of trees, if the vertices in them are given appropriate labels, then the edges are deterministic. For simplicity, let’s consider binary trees; let the vertex space be

where is some natural number. Thus, a vertex has the form , where each and is some arbitrary element that denotes the root vertex (see Figure 9). Let be the projection of a vertex to its first components:

Let a tree be a set of vertices such that, for each vertex in , its ancestors are also in it:

If , we will refer to it as the empty tree. Given a tree , define the parent, children, and siblings of a vertex as:

#### Basic Models

In this section, we consider random tree models over a finite tree space based on marginal random variables that take values in , i.e., are also random trees. In the next section, we expand the set of marginal variables to also include tree parts that do not take values in , but rather in substructures of this space. Let be a set of trees that is projectable, i.e.:

where denotes the set of all subtrees of . We can then define tree projections:

This projection is similar to the one used for (general) graphs, the main difference being that the set of vertices being projected onto cannot be an arbitrary subset of the vertex space, but must correspond to a tree (i.e. ). The reason is so that the projection of a tree is always a tree (in a projectable tree space ). We consider projections onto substructures of the tree space in the next section.

Suppose we have a distribution over the tree space . For each , we can define a marginal random (tree) variable taking values in :

For the projection family , the atomic projections correspond to vertex sets that are trees with only one leaf, where a leaf is a vertex with no children (i.e., a vertex such that ); we will refer to a tree with a single leaf as a path-tree. The reason the set of atomic projections corresponds to the set of path-trees is because: (1) any tree can be represented by a set of path-trees; and (2) no path-tree can be represented by a set of smaller path-trees. Let denote the set of path-trees:

To define structure in distributions over the tree space , we can apply a graphical model. We use a Bayesian network here; let be a structure graph, where is an edge function that is asymmetric (where the asymmetry is used in specifying edges that are directed1). To define valid distributions, the structure graph must be acyclic and must specify a dependency between any two path-trees in which one is a function of the other; thus, we will assume that there is: (1) a directed edge from every path-tree to its immediate successors (i.e., the path-trees that contain it and have one additional vertex); and (2) there is no directed edge from a path-tree to any path-tree that is a subtree of it. That is:

1. for all , .

2. for all .

These requirements on the edge function ensure it is consistent with the chain rule:

where denotes the set containing path-trees of cardinality .

#### Substructures

Similar to multivariate random variables, the use of projections onto substructures (Section 4.2) is important when modeling random trees. These additional projections allow one to form additional marginal random variables, which in turn, allow statistical models to specify more structure in distributions.

Let a shifted tree be a pair , where is a vertex and is a set of vertices such that:

1. ;

2. .

In other words, a shifted tree may be thought of as a tree in which serves as the root vertex. For the tree space , let denote the set of shifted trees with root vertex , i.e.:

where denote the set of vertices that are descendants of , i.e.,:

For a vertex , the space is not a subset of the tree space, but rather a substructure of the space . For a given vertex , we define the projection taking trees in to trees in as follows:

For a random tree taking values in a tree space , the substructure projections define marginal random (shifted tree) variables of the form , where and . Each substructure is itself equipped with tree projections. Hence, allowing for both projections to substructures and then projections to trees within this substructure, the set of all projections on is the projection family , and the atomic projections are just the projections onto individual vertices, i.e., those in the set .

To define structure in distributions over the tree space , we can apply a graphical model. Let be a structure graph, where is an edge function. As before, let the structure graph be acyclic, and require it specify a dependency (either directly or indirectly) between any two vertices in which one is an ancestor of the other.

Similar to general graphs, it will often be useful in the statistical modeling of trees to incorporate invariance assumptions about (shifted) trees that are isomorphic to each other. Recall, two graphs are said to be isomorphic if they share the same edge structure (see Section 5.3). Similarly, two rooted trees are said to be isomorphic if they share the same edge structure, as well as the same partial ordering structure:

#### Attributed Trees

In many real-world problems, the vertices in trees have attributes associated with them. In most of the literature on attributed trees, grammars are used to define the tree space (i.e., the set of trees the grammar can produce). These grammars produce trees by production rules; beginning with the empty tree, larger trees are incrementally built by the iterative application of these rules.

For context-free grammars, distributions can be defined over trees by associating a probability with each production rule. However, it is well-known that this approach (associating probabilities to production rules) does not generalize to the case of context-sensitive grammars (i.e., does not produce well-defined distributions for this grammar). The reason is because, in context-sensitive grammars, the order in which production rules are applied now matters (in determining what trees can be produced), and hence this grammar must have an ordering policy that specifies the next production rule to apply, given the current tree; this policy is a function that generally depends on many of the vertices in the current tree. Hence, to define a distribution over this tree space, the conditional probability of the next tree in a sequence, given the previous one, would not (in general) be conditionally independent of vertices even far removed in tree distance from the vertices being used by the production rule itself. In other words, to make well-defined distributions for a context-sensitive grammar, very high-order models are required.

In this section, rather than trying to define distributions in terms of grammars, we use a graphical model approach; by using the marginal random variables in a random tree, it becomes tractable to specify dependencies and make well-defined tree distributions that are, loosely speaking, context-sensitive. Let an attributed tree be a pair , where is a tree and is a function taking each vertex to some attribute value in an attribute space . For an attribute space , let denote the space of attributed trees:

Since need not be finite, the space may not be finite either (we have only assumed the vertex space is finite, implying a finite number of projections). The definition of a projection on a tree can be extended to a projection on an attributed tree in a straightforward manner: for a tree , let the projection be the intersection of the tree’s vertices with and the restriction of the attribute function to these vertices. We let .

Let an attributed shifted trees be a triple , where is the designated root and , the space of shifted trees with respect to . The definition of isomorphisms for attributed trees is the same as before, except with the additional requirement that the attribute values also match: the trees and are isomorphic to each other if there exists a bijection such that:

1. with respect to .

2. .

Two trees that are isomorphic are denoted by .

### 3.2Merging Models

In the previous section, we used a vertex space in which the label of each vertex encoded its entire ancestry; hence, if we know a vertex is in a tree, then we also know its ancestors as well, and this limits one to branching models. In this section, we consider a vertex space in which the label of each vertex instead encodes its descendants, allowing merging models for random trees: beginning with some set of initial objects, trees can be formed by iteratively merging them. Examples include the modeling of cell fusion (i.e., cells that combine) and the modeling of mergers between industrial corporations (which, in the end, form monopolies). We present a simplified version of the vertex space here; it can be extended to more sophisticated forms. As before, due to the structure of trees, if the vertices in them are given appropriate labels, then the edges are deterministic.

Suppose we have some set of vertices such that, for every tree, its leafs are in this set; beginning with some set of vertices , trees will be constructed by merging them. Letting , define the vertex space to be:

Thus, a vertex has the form . As before, we assume binary trees for simplicity; a tree is a set of vertices such that:

1. There exists a vertex such that for all , . This vertex corresponds to the root of the tree.

2. For each vertex , its cardinality is , for some . The value for a vertex corresponds to its level, which we denote by .

3. For each vertex such that , there exists a binary partition of this vertex (i.e., and ), such that and .

An example tree is shown in Figure 10. If , we will refer to it as the empty tree. In this tree definition, a vertex is a leaf if and only if it has cardinality of one (i.e., ). Hence, the label of each individual vertex defines if it is a leaf or not (unlike in the previous section). For a tree , let denote the set of its vertices that are leafs:

This distinction, in turn, means that for a subset to be a tree (i.e., a subtree of ), its leafs must be a subset of the leafs of (i.e., ). This requirement is in contrast to the previous section, where trees and their subtrees had to have the root vertex be in common.

Let be a set of trees that is projectable, i.e.: , where denotes the set of all subtrees of . As before, we can then define tree projections:

In the case of the projection family , the atomic projections are not a subset, but rather coincide with the entire projection family. However, assuming projections to substructures as well, as was done in the branching models, we then arrive at the same set of atomic projections, the set of individual vertices .

To define structure in distributions over the tree space , we can apply a graphical model. We use a Bayesian network here; let be a structure graph, where is an edge function that is asymmetric (where the asymmetry is used in specifying edges that are directed). To define valid distributions, the structure graph must be acyclic; for merging models, we assume that edges are in the direction from leafs to root. We must specify a dependency between any two vertices in which one is a function of the other; thus, we assume:

1. for all and .

2. for all .

These requirements on the edge function ensure it is consistent with the chain rule:

where denotes the set of vertices that are on level .

If one assumes that a vertex can only merge with one other vertex on a given layer, then complex dependencies are introduced in which a vertex depends on more than just its children; this situation is similar to that of context-sensitive grammars in branching models, except in the reverse direction. In this case, complex models can result.

### 3.3General Models

In the previous sections, we used specialized vertex spaces for defining trees; using vertices with labels that specify its set of possible children or possible parents (and assuming, in any valid tree, these sets are non-overlapping), then trees have deterministic edges, given the vertices. However, we could instead define trees in terms of an arbitrary vertex space and then define the tree space by restricting the corresponding graph space to only trees. This has the advantage of allowing one to employ any type of graphical model for random graphs (Section 2). In this more general formulation, distributions need not be defined in terms of how trees are incrementally constructed by a top-down or bottom-up process, but rather how they deconstruct (e.g., into subtrees). This allows, for some problems, a more natural method for defining distributions since it may allow a more compact representation of dependencies.

We will assume the vertex space has some minimal structure, allowing us to define trees based on basic conditions on the vertices and edges. Suppose we have a vertex space of the form

where each space corresponds to the set of vertices that can occur on the th level of the tree (i.e., the distance from a vertex in this set to the root is assumed to be in any tree). Further, we assume for every . For example, for modeling real-world scenes, often one assumes some fixed hierarchy of objects (e.g., cars occur on the th level and car tires occur on the th level). Finally, suppose the edge space is binary.

Let a tree be a graph with respect to this vertex space and edge space (i.e., where is a set of vertices and a binary edge function) such that the following conditions are satisfied: letting , we have that:

1. There is only a single root vertex: if , then .

2. Every (non-root) vertex has one and only one parent: for , for all , we have:

3. There are only edges between adjacent layers: for all such that , we have that for all and .

Let be the space of all such trees. A distribution over this space can be defined using a random graph model; in particular, we may apply an undirected or partially directed model. As mentioned, this additional flexibility may be useful for the modeling of some problems in which there is no obvious causal mechanism.

## 4General Random Objects

In this section, we consider a general formulation of graphical models on a sample space based on a family of random variables with basic consistency and completeness properties. In the literature, the definition of consistency for random variables is stated in terms of distributions ([11], [44]). In this work however, we find it convenient to define consistency in terms of the functions themselves (rather than the distributions induced by them). This more elemental definition will be useful in modeling over more general spaces, where to make independence assumptions on distributions, a consistent projection family must first be specified. The projections from this family then define random variables that are consistent (referred to as marginal variables). We begin by considering the case in which projections are from a given sample space to subsets of it; the random graph model discussed in Section 2 uses projections of this form. Then, we consider more general projections, where for example, the random tree model discussed in Section 3, and the traditional formulation of graphical models for multivariate random variables are instances. For simplicity, we limit the formulation here to finite projection families.

### 4.1Projection Families

Suppose we have a random object taking values in some space , and suppose we have a family of projections where each projection has the form . Recall, a function is a projection if , i.e., projecting an object more than once does not change its value. In order to produce random variables that are consistent with each other, the projections must be consistent with each other:

In other words, two projections are consistent if: (a) one’s image is not a subset of the other’s; or (b) projecting an object onto the smaller space is the same as first projecting the object onto the larger space, and then projecting onto the smaller space. We say that a projection family is consistent if every pair of projections in it are consistent. A consistent family of projections defines a consistent family of random variables (referred to as marginal variables).

Although this definition of consistent projections corresponds to the definition of consistent random variables, it will be useful when formulating graphical models to assume a stronger form:

As before, we say that a projection family is strongly consistent if every pair of projections in it are strongly consistent. The canonical projection family for random graphs (Section 2.2) is strongly consistent, and the canonical projection family for random vectors (i.e., the coordinate projections) is strongly consistent as well (see next section). We illustrate the importance of strong consistency in modeling with an example:

This example motivates formulating graphical models in terms of strongly consistent projections. It will be convenient to index projection families so as to indicate which projection’s images are subsets of other’s. This can be done as follows. For a finite consistent projection family , there exists some finite set and , such that we can write:

and where:

1. .

2. .

3. .

Further, we will assume that is a minimal set for indexing in this way (in the sense that there does not exist such that and the above holds). Thus, the indices in an index set show when the images of projections intersect or are subsets. Henceforth, we assume projection families are indexed in this way.

We now define a completeness condition for a projection family, which will also be useful in modeling:

In other words, for any two projections and in , a projection of the form also exists in it. Notice that if a projection family is consistent and complete, then it is also strongly consistent. Conversely, if a projection family is strongly consistent, then it can be made complete by augmenting it with additional projections. For modeling purposes, the value of completing a projection family in this sense is that it provides a larger space of possible independence assumptions. Since the traditional formulation of graphical models is in terms of a consistent, complete system of projections, we will define the extended formulation likewise. We now define the notion of atomic projections.

Loosely speaking, for a projection family , a projection in it is atomic if: (1) there does not exist a projection in this family that projects to a subset of its image; or (2) if there are projections in this family that project to a subset of its image, then this set loses information. The second condition ensures that any object projected by a set of atomic projections can be reconstructed. To define this more formally, we introduce some notation. For a projection family indexed by , let denote the subset of projections indexed by , i.e.,:

We say that a set of projections is invertible over a set if there exists a function such that

We define atomic projections as follows:

In other words, for a projection family, the atomic projections are those with either the smallest images, or if there are projections with smaller images, they cannot be reconstructed from them. We will call a random variable atomic if it corresponds to an atomic projection. Finally, to be used in modeling, we need to assume that a projection family has enough coverage over a space so that it can be used for representing objects in it:

If a finite projection family over contains the identity projection , then there exists an atomic representation of within . If a finite projection family over contains the identity projection and is consistent and complete, then it has a unique atomic representation. For defining a graphical model over with respect to a projection family , we will let its structure graph correspond to an atomic representation of within (i.e., the vertices in the structure graph will correspond to the projections in the atomic representation); if the atomic representation is unique, then so will be the vertex set in the structure graph. We can now express, for graphical models, the requirements on projection families.

Suppose we have a consistent, complete system of projections over an object space . Further, assume is finite, non-empty, and contains the identity projection . With only these assumptions on the projection family, we can model distributions over using independence and factorization, the invariances used in graphical models. The projections, since they are consistent, define a set of marginal random variables, as well as a unique set of atomic random variables, and so we can encode independence assumptions in a compact form using them. A projection family on the space also gives rise to a Gibbs representation for distributions over it:

where , and where , facilitating factorization and the use of undirected models. If there are functional dependencies in the atomic projections2, then the objects in have structural constraints, and the structure graph must respect them. If there are no functional dependencies, then the Hammersley-Clifford theorem may be directly applied; otherwise, a partially directed network may be necessary.

The formulation of graphical models given here encompasses the random graph models from Section 2. However, notice that in graphical models for multivariate random variables, the projections are not to subsets of the sample space, but rather to substructures. We now turn to this topic, and extend the formulation to include these more general projections.

### 4.2Substructures

In the previous section, we discussed projection families in which each projection’s image was a subset of its domain (i.e., for each projection , we have ). In this section, we now consider functions on in which their images may not be a subset (i.e., ); for convenience, we will view these functions as mapping into substructures, and refer to them as projections. This extension is important because it allows projection families to be larger, which in turn, allows additional structure to be incorporated within models. An example of a projection from a space to a substructure is the the projection of a vector to one of its components. For simplicity, the development here is not given in terms of substructures; the ideas can be stated in terms of the projections, without making more explicit assumptions about the structure of the space .

Suppose we want to define a distribution over a space using graphical models, and further, suppose we have a set of projections of the form ; since the images of these projections are not necessarily subsets of , the composition of these projections is no longer well-defined (i.e., the image of one projection is not necessarily a subset of the domain of another projection). In order to define a notion of consistency for this projection family, there must exist projections between these spaces. For a projection family

where is an index set, suppose that the following conditions hold:

1. (Completeness) For all such that , we have that .

2. (Consistency) For all such that , there exists a projection of the form

and this projection is defined by

If these conditions hold, then we say the projection family is consistent and complete (this is a natural extension of the definitions in the previous section). Incorporating these projections into the above projection family, we have the following set of projections:

where , and .

Another example is the projection to substructures used in modeling random trees (Section ?). For a sample space with a distribution over it, if we have a finite, consistent, complete system of substructure projections on , then we can define a marginal random variable for each index in this family, and similarly, we may also define a Gibbs form (Equation 5). Hence, we have arrived at a general framework, based on general projections.

### 4.3Compositional Systems

A projection family on an object may be viewed as defining a compositional system. Compositionality refers to the phenomena in which objects are composed of parts, which in turn, are themselves composed of parts, etc., and that the same part can occur in multiple larger parts. For a given set of objects , a projection family on it defines the decomposition of objects into a hierarchy of parts, and this may be viewed as a top-down approach to defining a compositional system. This approach for defining these systems differs from that taken in [22]; in that work, given a set of primitive parts , a set of composition rules are used to define the allowable groupings of parts into larger parts, and may be viewed as a bottom-up approach to defining compositional systems. The alternative perspective offered here on these systems is very different than that taken in the literature; our intention is only to provide a context to how graphical models, as formulated above, fits among other general frameworks, and this is only one possible interpretation of the relationship.

To illustrate modeling a compositional system, consider character recognition, a classic problem in the field of computer vision. The goal is to design a computer vision system that takes images of handwritten characters and determines the character being displayed. Let the object space be the space of possible binary (i.e., black and white) images with a label (i.e., we assume every image has a label attached to it from some label set). Following along the lines of the example given in [22], the most primitive parts may be images with only a single black point, having the label ‘point’; the next simplest parts might be images with only two points within close proximity, having the label ‘linelet’; these objects can be combined to form objects with the label ‘line’, which in turn can be combined to form objects with the label ’L-junction’, and so on, until finally objects with the label ‘character’ are formed.

If instead of defining a part as a single element, we let it be a random variable that takes values in a subset of this space, then we may associate them to projections (taking images and their labels to a subset of the images and their labels). These projections are consistent with each other, and so define a consistent set of marginal distributions over these parts. In turn, these marginal variables allow a graphical model approach to be applied to the problem, allowing the efficient estimation of distributions over the object space .

We note that in the approach to compositional systems based on projection families, the composition rules (describing how to combine parts into larger parts) are provided, in a sense, by the (marginal) probabilities of parts having some value, conditioned on the value of its constituent parts (the value of its projections). When this probability is nonzero, one may interpret that a composition rule is dictating that these constituent parts are combinable.

## 5Examples

In this section, we consider the practical application of the models described in the previous sections. Since the random graph models are more general than the random tree ones, and because they differ more from the models in the literature, we focus our attention on them here. We will use factorization to specify structure in distributions (which, for graphs, differs from specifying independence assumptions, see Section 2.6); the reason is, for the examples considered here, this invariance is more straightforward to specify and operate on. We also discuss invariances on distributions based on graph isomorphisms, an assumption used in many random graph models. The use of these invariances on unattributed graphs, however, causes models to be susceptible to degeneracy problems. To avoid this issue, it is important for models that employ these invariances to assign latent variables to the vertices, or equivalently, to use attributed graphs. We will assume models that take a simple exponential form based on the use of template graphs. We illustrate the ideas with several examples.

### 5.1Compact Distributions

Although a distribution over a finite graph space can always be specified by directly assigning a probability to each graph in it, in practise we need to make assumptions about the distribution. In Section 2.6, we discussed Gibbs form and the specification of structure based on factorization, where a Gibbs distribution has the form:

In the examples considered here, we find it natural to allow slightly more structure than can be obtained only through the specification of factors; we also want to be able to assign individual graphs to have a factor value of zero. In other words, we are interested in defining structure through the specification of a small subset such that by assigning a potential value to each graph in , the probability of every graph in can be determined. Hence, given a basis, we assume the potential of any graph is zero, and define the probability of a graph as

where and .

The model given in Equation 7 can be further simplified by assuming the function has some structure. This can be done in many ways; the simplest is to assign the same function value to graphs that are similar in some sense. For example, we might want graphs that are isomorphic to each other to have equal values (i.e., setting for all that are isomorphic). More generally, we can specify structure in by assuming an additive relationship of the form:

where each is a subset of the basis and each a real number. Then the model in Equation 7 simplifies to:

where is the number of subgraphs of type in the graph . We will find it convenient to reformulate each set as a binary function: define a function such that

Then, equivalently, we have that . We refer to the binary functions as compatibility maps. We now consider methods for specifying these maps.

### 5.3Graph Isomorphisms

An important way to compare two graphs is based on how their parts compare. In this section, we consider isomorphisms, a comparison method based on second-order subgraphs; two graphs are said to be isomorphic if they share the same edge structure:

A distribution over a graph space is said to be invariant to isomorphisms if any two graphs that are isomorphic have the same probability, i.e.:

where . We now consider some isomorphism variations that will be useful for attributed graphs.

### 5.4Attributed Graph Isomorphisms

When modeling unattributed graphs, often it is important to associate attributes to vertices in these graphs. The attributes, in this case, may be thought of as latent variables, which can simplify the order of models. Suppose we have a finite vertex space , an edge space , and an attribute space . Recall from Section ?, an attributed graph has the form , where:

The simplest isomorphism for attributed graphs is based on the edge structure and attributes on individual vertices:

This definition is a natural extension of Definition ? to attributed graphs. As an example, suppose the attribute space is some finite set of labels or colors; then for graphs to be isomorphic by this definition, the coloring of vertices must be respected in addition to the edge structure. The next simplest isomorphism for attributed graphs is based on the attributes on pairs of vertices. Suppose we have a distance function over the attribute space .

This second-order isomorphism is used in many latent position models ([36]), where is a Euclidean space; for models using this isomorphism invariance, the probability of a graph depends on the distances between vertices in it, not on their particular locations. These definitions can be extended to higher-orders in a straightforward manner. To summarize, we presented some isomorphisms that can be used in specifying when graphs are similar to each other. We will make use of them to specify compatibility maps in the examples presented later in this section.

### 5.5Master Interaction Function

In defining distributions over a graph spaces, often it will be useful to reduce the size of the graph space, removing graphs that have zero probability. One way to do this, assuming that the edge space has a partial ordering , is to define a function that restricts the edge configurations allowed in graphs:

We use master interactions functions to restrict graph spaces to only those graphs that respect them. That is, for a graph space and some master interaction functions and , we can restrict the graphs to the set:

### 5.6Examples

In this section, we illustrate the above ideas with some examples. In each example, the model takes the form of Equation 8, and uses some set of templates . For each template , the compatibility map is based on if a graph is isomorphic to it, i.e.:

In all the examples except the first one, we assume the isomorphism used is the first-order isomorphism. The sampling and learning algorithms are discussed in Section 5.7.

#### Example 1: Grid Graphs

We consider unattributed grid-like graphs such as the one shown in Figure 11. Let the vertex space be be a grid of size , and let , specifying the absence of an edge or the presence of an edge, respectively. We can specify the master interactions function to take pairs of vertices that cannot have an edge to the value , and pairs that can have an edge to the value 1. Define as follows:

where and . Hence, this master interactions function ensures the graph space only contains grid-like graphs.

A possible set of templates is shown in Table 1. Each template in these tables specifies a compatibility map based on graphs that are isomorphic to . Here, we made the following design choices. First, we have limited the order of the template graphs to fourth order and lower (i.e. graphs such that ). Secondly, to make computation feasible, we apply a ‘locality’ principle in which only connected graphs are used as templates. Since unconnected graphs constitute the vast majority of the subgraphs in for any given graph , the restriction to only these is necessary for computational reasons. For example, consider the second-order subgraphs in the graph in Figure 11; there are subgraphs of this order, but only of them are connected. If we consider higher-order subgraphs, this gap widens.

Given these templates, the number of subgraphs that correspond to a given pattern can be calculated for any graph , and hence its probability can be calculated. For example, for the graph in Figure 11, the probability is expressed as follows:

#### Example 2: ‘Molecule’ Graphs

We consider an example in which the graph space is composed of graphs that loosely resemble molecules in appearance. An example is shown in Figure 12.

In this example, we will use attributed graphs of the form . Let be the vertex space, the attribute space, where each represents a color, and the edge space. We can specify the master interactions function to specify that vertices with the same color cannot have an edge between them (e.g., set if ). Similarly, we might want to specify that vertices with certain different colors can have an edge between them (e.g. set for some ).

A possible set of templates and their corresponding parameters is shown in Table 2. For each template graph , we define a compatibility map based on graphs that are second-order isomorphic to it (Equation 9). Given these templates, the number of subgraphs that correspond to a given pattern can be calculated for any graph , and hence its (unnormalized) probability can be calculated. For the graph in Figure 12, the probability can be expressed as follows:

Notice that in this example, the attributes (i.e., the colors associated with vertices) allow distributions in which, loosely speaking, typical samples have complex structure even despite the fact that the basis does not contain high-order graphs. For example, the edge structure in these graphs are very unlikely to have been generated by independent coin flips as in an Erdős-Rényi model. If the vertices did not have these attributes and we wanted to define a distribution that has equivalent probabilities as in this example, (e.g. assign the same probability to the unattributed version3 of the graph in Figure 12), it would be necessary for any basis to contain graphs of much higher orders than those in the basis used in this example. Hence, we see that attributes are important latent variables even if one only wants to define distributions over unattributed graph spaces. Thus, ideas contained in latent position models ([36]) and latent stochastic blockmodels ([1], [45]) can be incorporated within the framework here.