A Unifying Generative Model for Graph Learning Algorithms: Label Propagation, Graph Convolutions, and Combinations

A Unifying Generative Model for Graph Learning Algorithms: Label Propagation, Graph Convolutions, and Combinations

Abstract

Semi-supervised learning on graphs is a widely applicable problem in network science and machine learning. Two standard algorithms — label propagation and graph neural networks — both operate by repeatedly passing information along edges, the former by passing labels and the latter by passing node features, modulated by neural networks. These two types of algorithms have largely developed separately, and there is little understanding about the structure of network data that would make one of these approaches work particularly well compared to the other or when the approaches can be meaningfully combined.

Here, we develop a Markov random field model for the data generation process of node attributes, based on correlations of attributes on and between vertices, that motivates and unifies these algorithmic approaches.1 We show that label propagation, a linearized graph convolutional network, and their combination can all be derived as conditional expectations under our model, when conditioning on different attributes. In addition, the data model highlights deficiencies in existing graph neural networks (while producing new algorithmic solutions), serves as a rigorous statistical framework for understanding graph learning issues such as over-smoothing, creates a testbed for evaluating inductive learning performance, and provides a way to sample graphs attributes that resemble empirical data. We also find that a new algorithm derived from our data generation model, which we call a Linear Graph Convolution, performs extremely well in practice on empirical data, and provide theoretical justification for why this is the case.

1 Label propagation, graph neural networks, and estimating attributes on nodes in graphs

Graphs, which consist of a set of nodes along with a set of edges that each connect two nodes, are natural abstractions of relational data and systems with interacting components. A common graph learning problem, often called graph-based semi-supervised learning, is to predict the labels or outcomes on a subset of nodes, given observations on the others [73, 17, 68, 5, 10, 20, 31, 29, 22, 52]. For example, a social networks company could use the age of users to better serve contents or sell ads, but the age of a subset of users might not be listed and therefore need to be estimated. Similarly, a city planner could try to predict traffic on streets in a network of roads given sensor readings on a subset of streets, or a political consultant might want to forecast election outcomes in different geographically-connected regions (e.g., the counties of a state within the U.S.) but only have polling data in certain locations.

One approach to these problems is based on the fact that two nodes connected by an edge are often similar, a principle known as homophily in the case of social networks [44] or assortativity more generally [47]. In the above examples, two friends on an online social network are more likely to be of a similar age [64], many cars on one street increases traffic on connected streets, and regional voting patterns are often positively correlated spatially [14]. If we only have outcomes and no other information, label propagation (LP) algorithms constitute a standard class of methods for making predictions at unlabeled nodes [73, 20, 72]. These algorithms find a label assignment for all nodes that (i) agrees with the known labels and (ii) varies smoothly over the graph. These algorithms can be formalized as an optimization problem of the form

(1)

where is the estimated outcome on each node, is the original outcome at if is labeled and 0 otherwise, is a constant, and is a graph-based smoothness penalty such as for the combinatorial graph Laplacian (see the the survey by Zhu for several examples [74, Section 6]). The names label propagation, label spreading, or diffusion for graph-based semi-supervised learning stem from algorithms that optimize objectives like those in Eq. 1, which can be viewed as procedures that propagate, spread, or diffuse outcomes on labeled nodes to the unlabeled nodes.

Of course, we often have additional information or features on nodes that are useful for estimating outcomes. For instance, the types of content consumed by a user in a social network would correlate with age, the weather affects traffic in a road network, and unemployment levels might matter for elections. Graph neural networks (GNNs) are a popular class of algorithms that use such information [57, 23]. In general, these algorithms solve an optimization problem of the form

(2)

where denotes the -hop neighborhood of vertex , is the set of labeled nodes, denotes the set of features of node , is a function that usually involves neural networks, and is the estimated outcome at node . The squared loss in Eq. 2 is based on a regression setup, which we focus on in this paper, although these methods are often used with a cross-entropy loss for classification problems.

The structure of Eqs. 2 and 1 are markedly different. At first glance, one might wonder why label propagation methods do not use node features. The reason is rooted in the original motivation for many of these algorithms — semi-supervised learning for point cloud data. In this setting, one typically assumes that (i) only a limited number of data points are labeled, (ii) the labels vary smoothly over a low-dimensional manifold, and (iii) the data points on which we would like to make predictions are known ahead of time [74]. The edges in the graph are then constructed from the features themselves. More specifically, an edge encodes the similarity of two points in feature space and the smoothness regularizer captures assumptions (ii) and (iii). That being said, label propagation methods have still proven tremendously valuable in settings where the edges of a given graph serve as a proxy for similarity [42, 20, 1, 13, 27, 24, 62], although one has to be careful about making this assumption in some settings [52, 9]. Furthermore, early semi-supervised learning algorithms discussed jointly using “external classifiers” with label propagation [73, Section 5] and collective classification methods combine both network structure and node features [58] (also see the related work below).

One might also wonder why graph neural networks do not explicit use the fact that connected nodes tend to share the same outcome or label. As an extreme case, suppose the outcomes varies smoothly along edges in a graph where all nodes have the same features. Then the GNN model of Eq. 2 is meaningless, but the label propagation model in Eq. 1 is still useful. A more realistic setting is one in which the features are mildly but not overwhelmingly predictive of the outcomes, in which case some combination of Eqs. 2 and 1 would seem reasonable. The machine learning community has recently developed several heuristics that (at least implicitly) address this shortcoming:

  • augment the features of a node with the output of a label propagation algorithm before using a GNN [60];

  • use label propagation algorithms to enforce that some intermediate quantity used in computing the function in Eq. 2 varies smoothly over the graph [30, 6];

  • use the output of a label propagation algorithm as a pre-processing step so that a GNN places more emphasis on certain connections [66];

  • use a label propagation as a post-processing step to enforce smoothness of the model estimates over the graph [24, 26];

  • use several functions as in Eq. 2, associating each with different parts of the graph where different types of outcomes may be more prevalent [69]; or

  • add a random field “layer” on top of the GNN model to learn possible correlated structure in the labels [18, 54].

These existing approaches are unsatisfying for a number of reasons. For one, the strategies are largely ad hoc, and there is little understanding of why or when one particular approach should work. In addition, the GNN model in Eq. 2 itself is opaque and not intrinsically motivated, so more complicated methods designed on top of GNNs are unlikely to yield meaningful insights. In general, there is limited theory for these approaches.

An underlying issue is that there has been no single generative model (i.e., a stochastic data generating process) for graph data that can motivate different prediction algorithms (like a multivariate Gaussian model motivates linear regression). In this paper, we address this issue with a generative model for node features and labels. This model reveals the connections between label propagation, graph neural networks, and some of the heuristics outlined above.

1.1 The present work: A Markov random field model for attributed graphs

Figure 1: A unified view of i.i.d. data (top row), labeled graph data (middle row), and attributed graph data (bottom row) under the Gaussian MRF framework. Each vertical bar in the Gaussian MRF diagrams connects a set of pairwise interacting variables. For each type of data, we derive a learning algorithm by the conditional expectation of the unknown labels, which gives linear regression for i.i.d. data, label propagation for labeled graph data, and linear graph convolution for attributed graph data. Moreover, for attributed graphs, the expectation of unknown labels conditioned on both features and observed labels leads to hybrid algorithms that combines principles from both label propagation and graph neural networks.

In this paper, we propose a new generative model for graph data that unifies the understanding for various graph learning algorithms. For this, we assume that the topology of the graph (i.e., the nodes and edges) is given but that the attributes — features and outcomes — on the nodes are randomly generated. An overarching idea is to model the correlations of attributes on and between nodes through a common data generation process, after which we can understand the appropriate mechanisms to use either type of correlation as well as their combination to infer unknown outcomes.

More specifically, given an attributed graph, we map each attribute on each node into a random variable, and we use a Markov random field (MRF) to describe the joint distribution of the random variables. The MRF consists of two types of potentials: (i) potentials modeling the correlation between different attributes on the same node and (ii) potentials modeling the correlation between attributes among connected nodes. By treating the outcome as one attribute and the features as the remaining attributes, the first type of potentials can make the features useful for estimating the outcomes. The second type of potential has two roles. First, these potentials make it easy to express a smoothness in outcomes over the graph as would be assumed by label propagation methods following Eq. 1; indeed, a special case case of our model when there are no features (and only a single outcome attribute per node) coincides with the Markov random field used to derive a label propagation algorithm [73]. Second, the correlation in features between connected nodes make it meaningful to combine features from adjacent nodes to improve estimation, which fits the general GNN model in Eq. 2 and supports the premise upon which GNNs are motivated [23].

This probabilistic data model unifies the ideas behind label propagation and graph neural network for graph data as well as linear regression for i.i.d. data (Fig. 1). In particular, if the node attributes are real-valued, then their joint distribution is a Gaussian MRF and therefore inference algorithms can be derived by computing the conditional expectation of the unobserved labels. Since the conditional expectation minimizes the mean squared error of predicted labels, the derived algorithms are optimal for the quadratic objective functions commonly used for node regression (e.g. Eqs. 2 and 1). In this sense, our analysis serves as a type of algorithmic anti-differentiation [19] for several graph learning algorithms by showing a reasonable objective that they are not obviously optimizing.

More specifically, straightforward computations reveal the following relationships with our model:

  1. the expected value of outcomes on unlabeled nodes, conditioned on the outcomes at labeled nodes, corresponds to a variant of label propagation;

  2. the expected value of outcomes on unlabeled nodes, conditioned on the features at all nodes, corresponds to a linear approximation of the graph convolutional network (GCN) [29]; this approximation algorithm is referred as Linear Graph Convolution (LGC) in the following discussion

  3. the expected value of outcomes on unlabeled nodes, conditioned on the features at all nodes and outcomes at labeled nodes leads to the same linear GCN approximation followed by the recently proposed residual propagation heuristic [26].

Finally, if there are no edges in the graph and thus no type (ii) potentials in the Gaussian MRF, then the attributes on every vertices is an i.i.d. sample and LGC reduces to linear regression.

Our model provides insights on when certain algorithms are suitable or helpful. Intuitively, label propagation performs better than graph neural networks when outcome correlation (homophily) is much more influential than the correlation between features and outcomes. Similarly, graph neural networks are more useful in the opposite case, when correlations between features and outcomes are much stronger than homophily. Finally, when neither features nor homophily overwhelms the other, there is opportunity to combine ideas from label propagation and graph neural networks.

Our model also provides theoretical grounding for a number of recent empirical methodologies in graph-based machine learning. In particular, there are several recently proposed “simplified GNNs” that avoid much of the neural network component [30, 67, 55, 35]. Although those methods are algorithmically similar to our proposed LGC, they are largely motivated as computationally economic alternatives to GNNs that maintains good accuracy, as oppose to conditional distributions from a generative model. Our data model provides some justification for why these approaches can give high prediction accuracy, but it also provides theory for why the LGC that we derive addresses fundamental shortcomings of some existing approaches. We also find that the LGC algorithm is often more or as accurate compared to standard GNNs on a number of real-world datasets.

The LGC algorithm arising from our analysis smooths features of nodes over the graph, using the output of label propagation on the features as input to a linear model. The idea of feature smoothing arises in many contexts for understanding GNN or GNN-inspired algorithms [6, 8, 38, 41], although some studies claim that “over-smoothing” can be problematic [34, 49, 70]. Our model provides a statistical framework for understanding feature smoothing and a principled way to understand the optimal smoothing level, which we can interpret as a balance between the effects of homophily and noise.

The data model also provides a framework for evaluating inductive learning performance, where a model trained on one graph is used to make predictions on another graph where the nodes have the same types of attributes. In this setting, we can simply sample two graphs from our model, fitting parameters from one and measuring accuracy on the other. Through this, we show how homophily can make inductive problems challenging and see how GNNs generalize poorly due to “memorization” mechanisms that achieve near-zero training error.

Finally, we can fit the generative model parameters to empirical data. From this, we can sample realistic graphs with the same attributes as test cases as well as quantify uncertainty in the data in order to predict regression accuracy.

1.2 Additional related work

Collective classification and inference.  Collective classification or collective inference refers to a general set of graph-based machine learning methods that model (i) correlations between outcome at a node and features at a node, (ii) correlations between outcome at a node and features of neighbors of that node, and (iii) correlation between the outcome at a node and the outcomes of neighbors of that node [25, 58, 39]. One class of methods called “iterative classification” builds consensus into a local classifier by alternating between predicting outcomes for each node with the classifier and then updating the model with the predicted outcomes on neighboring vertices [46, 40, 42, 43, 71]. The other class of methods are based on a global formulation conceptually similar to this work, where they model categorical attributes with a discrete MRF and predict unknown labels by selecting the ones with the maximum marginal probabilities. However, since the marginally probabilities in a discrete MRF have no analytically expression, those methods require fitting the MRF potentials and then estimating the marginal probabilities with loopy belief propagation or mean field approximations [58]; both steps are computationally expensive and require careful initialization. In contrast, our model motivates efficient and stable learning algorithms that can be directly trained on the observed labels.

Regression with network autocorrelation.  The idea that outcomes should be correlated on a network is a longstanding idea in social network analysis [12, 32, 7, 37], where tools from spatial statistics (namely, spatial autocorrelation) are adapted to the network setting. Recent analysis of penalty-based statistical models that encourage smoothness in parameter estimates over the graph provides statistical rigor to similar models [36]. At a high level, these models aim to estimate the effects of covariates, which is different from our transductive setting where outcomes are missing on some nodes in the graph.

Attributed graph models.  Finally, there are several generative models for attributed graphs [28, 33, 53, 16]. These models largely revolve around modeling the probability of edges given node attributes. In contrast, we assume that the edges are given and develop a probabilistic model for the attributes.

2 A model for attributed graph data and deriving graph learning algorithms

Let be an undirected graph, where is the set of vertices (nodes) labeled , and is the set of edges. We denote the (possible weighted) symmetric adjacency matrix of the graph by , and the diagonal degree matrix by , i.e., , where is the vector of all ones. We use for the normalized graph Laplacian, where is the symmetrically normalized adjacency matrix. We assume that each node is associated with a vector of features and a scalar that we call the label or outcome. We use for the feature matrix and for the outcome vector. For convenience, we stack the feature matrix and outcome vector into an attribute matrix . The th row of the attribute matrix (denoted ) represents all attributes on vertex , and the th column of the attribute matrix, denoted , represents the th attribute on all vertices. For simplicity, we further assume each attribute is preprocessed to center around zero, i.e., for .

In this section, we consider algorithms for semi-supervised learning on graphs, where outcomes are available on a set of labeled nodes , and we want to predict the outcome at the remaining nodes. More formally, the input to an algorithm for this problem is (i) the graph adjacency matrix , (ii) the features on all nodes, and (iii) the labels on a subset of vertices . The output is a vector of labels on the unlabeled nodes . As discussed above, there are myriad algorithms for this problem. Our goal is to develop a generative model for the attributes under which several popular algorithms are optimal in some sense.

We propose a Gaussian Markov random field as a generative model for attributed graph data. We then show how various conditional expectations under this model lead to (variants of) graph-based learning algorithms such as label propagation, graph convolutional networks, and residual propagation. Since the conditional expectation minimizes the mean-squared prediction error under the data generation model, these algorithms are “optimal” as measured by quadratic loss functions. Our generative model maps each feature or label into a random variable. We use italic symbols for observed values of those quantities and non-italic symbols for the corresponding random variables in our model. We assume that the graph topology, i.e., the matrix , is given and not random.

Finally, our derivations repeatedly uses marginalization and conditioning of multivariate Gaussian distributions. Section A.1 has the necessary background on these computations.

2.1 The model for attributed graph data

We assume all vertex attribute values are jointly sampled from a distribution over random variables , where each attribute on each node is a random variable. Our model for the joint distribution of the attributes is a Gaussian Markov random field (MRF) with probability density function:

(3)

Here, the model parameter is a symmetric positive definite matrix and the model parameter is entrywise positive, and the log-potential is defined as

(4)
(5)
(6)

For convenience, we write the potential as the following quadratic form:

(7)

At a high level, the first term in Eq. 4 encodes correlations among different attributes on each vertex. The second term increases the probability of having attributes that are smooth over the graph, since and , where denotes the degree of node . This type of smoothness assumption is natural for homophily. These two terms will be analyzed in detail in the following discussions. One could choose other notions of smoothness, such as the quadratic form on the combinatorial Laplacian instead of the normalized Laplacian; however, the normalized version will eventually lead to algorithms that are closer to those existing in the literature.

Since and and are symmetric positive definite, is also symmetric positive definite, and Eq. 3 defines the multivariate Gaussian distribution

(8)

where is the precision matrix. In other words, the attributes are jointly sampled via

(9)

Next, we see how conditioning on observing different attributes leads to different graph learning algorithms, as illustrated in Fig. 1.

2.2 Linear regression when there are no edges

If there are no edges in the graph, then the attribute vectors over the nodes are i.i.d. samples from a multivariate Gaussian, resulting in a linear regression model. This is a well-known setup for linear regression [45], and we verify it here for our model to aid in understanding more complicated cases later.

When the edge set is empty, the log-potential function decomposes into

(10)

and the probability density function in Equation 3 can be factorized accordingly:

(11)

In other words, are i.i.d. samples from a multivariate Gaussian with mean and precision matrix . Therefore, for any vertex , the conditional expectation of its label is given by:

(12)

where the second equality directly follows from Eq. 65. Or, in matrix notation,

(13)

which is just linear regression with coefficients .

Rather than fitting all of the model parameters , this derivation suggests a simple algorithm for label prediction — find the optimal on the training data with, e.g., ordinary least squares, and make predictions for all . There are many similar ways to arrive at linear regression and ordinary least squares, but this pattern will be helpful for deriving the algorithms in the following sections.

2.3 Label propagation when conditioning on observed labels

Next, we consider the setting in which there are no features in the graph () to demonstrate how our data model encodes label homophily and leads to a label propagation algorithm. Gaussian random fields were used to develop early label propagation algorithms for graph learning [73], where the potential function is . Our potential function is similar, so it is unsurprising that we arrive at a similar algorithm. However, our model has a parameter that helps balance label homophily and noise, and this will be crucial for our graph neural network approximations later.

In this setting, the positive vector and positive definite matrix reduce to positive scalars and . As a result, our model assumes that the labels are jointly sampled from a multivariate Gaussian:

(14)

The parameter controls the homophily level: when is larger, sampled outcomes are smoother along the graph topology. The parameter controls the noise level: is the variance for the outcome on each individual node if the graph contains no edges. The conditional distribution of given is:

(15)

where the conditional mean is

(16)
(17)

with as a parameter that controls the smoothing level, as we will show next. We prove in Section A.2 that the conditional mean is the fixed point (i.e., ) of the following label propagation algorithm

(18)
(19)

where grows monotonically with , and is the 1-hop neighborhood of (i.e., the set of nodes connected to by an edge). Since and are bijective and strictly monotonic, we use them interchangeably throughout the paper. The update equation is equivalent to the label propagation formulation by Zhou et al. [72], with the extra constraint that the predicted outcomes on observed vertices are fixed to their true values.

Alternatively, we can also connect this to an optimization problem similar to Eq. 1, which aligns with many classical derivations of label propagation algorithms for semi-supervised learning [73, 72]. In particular, consider

(20)

where is given, and . This optimization problem is equivalent to the semi-supervised learning formulation of Zhou et al. [72], just with an additional constraint requiring the final solution exactly matches the observed labels, which was separately considered in a similar formulation by Zhu, Ghahramani, and Lafferty [73]. Setting and , one can show that the conditional mean is a stationary point solution of Eq. 20 and that the label propagation algorithm corresponds to a projected gradient descent algorithm.

In the label propagation algorithm of Zhou et al. [72], the factor just comes from the regularization hyperparameter in the optimization framework given by the unconstrained version of Eq. 20. Under our model, the optimal value of (and hence ) is determined by the homophily and noise levels in the data generation process — higher homophily (larger ) and higher noise (small ) require more smoothing (larger ), or equivalently, larger in Eq. 19, placing more weight on neighboring nodes. Still, when implementing label propagation in practice, choosing via cross-validation can be more practical than estimating and , which is an approach that we take in many of our numerical experiments.

For this derivation, we assumed that there were no features on the nodes. An alternative derivation might assume that the nodes have features (generated by our model) and then marginalize over them before conditioning on observed labels. While conceptually similar, this actually leads to a different label propagation algorithm, which we derive in Section A.3. This alternative algorithm does not have sparse computations, so we favor the above approach.

2.4 Linear graph convolutions when conditioning on features

Next, we consider a graph sampled from our attributed graph model and conditioning on the features of all nodes (but not the known labels). While conditioning on both features and known labels is natural (and we develop that in the next section), the setup here closely mirrors standard graph neural networks (GNNs) of the form in Eq. 2. Thus, this section exposes a natural flaw in standard GNN approaches in any graph dataset with homophilous structure, namely that they ignore correlation in the labels.

Let and denote the precision matrix indices for features and labels, respectively. Given the features and using Eq. 65, the conditional distribution of the labels is

(21)

The precision matrix is , where and controls the homophily level and the noise level of the outcomes. The mean of the distribution is

(22)
(23)
(24)
(25)

where and . Just as we did with linear regression, given data, we can translate the conditional expectation into an algorithm by fitting with ordinary least squares on the known labels . We call this a linear graph convolution (LGC) based on our analysis below.

Comparison with linear regression.  The conditional expectation boils down to a linear function of in Eq. 25. Thus, we still reduce to linear regression, and the only difference with Eq. 13 is that the features are transformed in a pre-processing step: . When there are no edges in the graph, and we reduce to Eq. 13.

Relationship to label propagation and feature smoothing.  The transformed feature matrix resembles the label propagation fixed point of Eq. 17. This is unsurprising given that the features are also drawn from a multivariate Gaussian; this time, we just don’t have any known labels that should remain fixed. More formally, let be a row of and the corresponding row of ; these are the original and transformed features for node . Then, following the same steps as we did for label propagation, we can write as the solution to the optimization problem

(26)

or as the fixed-point solution of a feature propagation algorithm

(27)
(28)

where for or for , and we have shifted the parameter seen in the first term of the objective in Eq. 20 to the second term in the objective of Eq. 26.

Equations 28 and 26 both highlight how each column of is smooth over the graph, and the parameter controls the amount of smoothing, as dictated by the generative data model. From the perspective of graph signal processing, similar type2 of smoothing has been discussed as a low-pass filter for features [35], and feature smoothing is a heuristic for several graph-based learning methods [6, 8, 38, 41]. Our model puts these ideas on firm ground.

Some studies argue that feature smoothing is problematic for graph-based learning [34, 70] (the so-called over-smoothing problem), although such analyses are largely empirical. Our data model elucidates the problem clearly from two fronts — attribute homophily and noise — as captured by the parameter . Thus, in our model, “oversmoothing” is really just a result of a misspecified . We will see this in our numerical experiments in Section 3.2.

Connections to (simplified) graph convolutional networks.  To make a connection from LGC to graph convolutional networks, we start with the fact that Eq. 25 is nothing but linear regression with feature smoothing as a pre-processing step. Therefore, it is natural to consider other types of feature smoothing. For one example, the simple graph convolution (SGC) network [67] uses the smoothing matrix instead of , where is the normalized adjacency matrix of the graph with self-loops:

(29)

The map explicitly performs steps of smoothing (weighted averaging) over neighbors:

(30)
(31)

There are a couple of mathematical oddities of SGC.3 First, the th power puts all of the emphasis on paths of length , and the self-loops are incorporated as a workaround to incorporate information on shorter paths. This contrasts with typical diffusions (e.g., PageRank or Katz) that explicitly put more weight on shorter paths (often with geometric decay). Our feature smoothing has the same geometric decay property, too, which one can see from the Neumann expansion

(32)

Second, since , converges to a projector onto the dominant eigenspace of as . In the typical case where this eigenspace is unidimensional, each column of converges to a scalar multiple of the dominant eigenvector of , which is independent of the features!

Oddities aside, comparing Eq. 28 with Eq. 31 reveals a more fundamental and important difference. To control smoothing, our LGC uses a real-valued parameter , while SGC uses an integer-valued hyperparameter , i.e., the number of convolutional layers. This is a drawback of SGC that can prevent the method from achieving the optimal feature smoothing level, which we will see in Section 3.1.

Finally, the Graph Convolutional Network (GCN) [29], which is the standard-bearer graph neural network, can be derived by just adding a nonlinear transformation of the features between each application of the matrix :

(33)

where is typically the rectified linear unit (ReLU), i.e., entrywise. The model parameters are fit by minimizing . Note that with a linear activation function , Eq. 33 collapses to Eq. 29 with . The linear feature transformation works empirically on some tasks, and this approach was simply motivated as a way to show that the nonlinear activation in GCN may be unnecessary [67]. While we will compare the performances of SGC and GCN on a wide range of datasets in Section 3, here we have given them new meanings as different approximations for the conditional expectation in our data model.

Recall that in all of these approaches, we are only conditioning on observing the features and not on the known labels — the known labels are just used to fit the parameters of the model. Of course, if we assume that the label distribution is correlated on the graph, we should use that information, which leads to the approaches in the next section.

2.5 Residual propagation when conditioning on both features and observed labels

The previous section showed that when just conditioning on observing the features on the nodes in our data model, we should first smooth them with the transformation . Graph neural networks operate similarly, constructing some representation matrix to use as features for a linear model (e.g., or for a 2-layer GCN or SGC). The representation depends on labels through an optimization process as in Eq. 2. Still, in either case, once the representations are given, predictions are independent at each node, as the objective in Eq. 2 is separable. This is strange as we often assume that outcomes are strongly correlated over the edges of a graph; GNNs capture this implicitly at best, if at all. Independent predictions (conditioned on representations) are a major shortcoming of GNNs, which has led to several heuristics for incorporating labels [60, 30, 66, 26, 69, 18].

Addressing this issue is simple within our model — we just compute the expected label at unlabeled nodes conditioning on both the features on all nodes and the known labels. Recall that the joint distribution of the labels on all nodes given all the features is

(34)

Using Eq. 65, the conditional distribution of given the observed labels is

(35)

The mean of this distribution is also the conditional expectation of :

(36)
(37)
(38)
(39)

where , is the regression residual on the observed labels, and we define . This is almost the exact expression as the conditional expectation in Eq. 17, which was the fixed point of a label propagation algorithm. The only difference is that the residuals are used for initialization, i.e., is the fixed point of the following propagation algorithm

(40)
(41)

where for . Thus, can be interpreted as the estimated regression residuals on the unlabeled vertices, which is used as a correction term for the LGC predictions.

In summary, Eq. 39 leads to a 3-step algorithm:

  1. Compute the LGC prediction with Eq. 25, fitting with known labels.

  2. Compute the regression residual on labeled vertices , and use label propagation initialized with these residuals to estimate the residuals on unlabeled vertices.

  3. Add the estimated residual to the base prediction to get a final prediction on the unlabeled vertices (Eq. 39).

We refer to steps 2 and 3 of this algorithm as residual propagation (RP), and the entire algorithm as LGC/RP, since is the prediction given by LGC. Here, the RP name comes from our prior work, in which nearly the same correction step was developed as a heuristic layer on top of several graph learning algorithms, where is replaced with a more general predictor, such as the output of a graph neural network or a multi-layer perceptron [24, 26]. Putting all of the algebra together, LGC/RP is a linear model in and uses only linear transformations of :

(42)

although the three-step procedure above is easier to interpret.

We will see in our experiments that adding RP to other algorithms for estimating often yields good empirical performance, matching prior research [24, 26]. However, we emphasize that the LGC/RP algorithm was derived from simply computing a conditional expectation on all of the available information (features and labels) in our attributed graph model.

3 Transductive learning experiments

We have shown that label propagation (LP), LGC, and LGC with residual propagation (LGC/RP) are all a result of computing the conditional expectation of the unknown labels in our proposed Gaussian MRF distribution, and that linear regression (LR) arises from a conditional expectation ignoring the graph topology. Next, we test the performances of those algorithms, along with other graph learning algorithms, on several synthetic and real-world datasets, largely to understand relationships among smoothing, nonlinearities, and out-of-sample prediction performance. In this section, we focus on the so-called transductive learning setting, where we train the models on some vertices in a graph and test on the rest of vertices from the same graph. In Section 4, we consider an inductive learning setting where the models are trained and tested on vertices from different graphs.

3.1 Experiments on synthetic datasets

We start our experiments on synthetic graph attributes sampled from the Gaussian MRF model.

Synthetic data sampling.  First, we select the synthetic graph topology by sampling from a Watts-Strogatz model with vertices, average degree of , and rewiring probability of . We set and randomly initialize the Gaussian MRF parameters, creating as follows: (1) uniformly sample five 5-dimensional Gaussian random vectors ; (2) create a matrix with ; and (3) set . If the vertex attributes are i.i.d.(), then  — given by the inner product of and  — is also the covariance between attribute and . For the parameter that controls the homophily strength of each attribute, we let , where are sampled from a uniform distribution supported on . We sample three attributed graphs, using , where larger values corresponds to higher level of homophily.

Experimental setup.  We compare LP, LGC, and LGC/RP with baseline algorithms linear regression (LR), SGC, and GCN. We also apply residual propagation post-processing to SGC and GCN, and we call these algorithms SGC/RP and GCN/RP. For each dataset, the vertices are randomly split into training and testing, and we use the coefficient of determination to measure regression accuracy. Our label propagation implementation uses the constrained formulation Eq. 19. The pre-processing smoothing step in LGC is performed by running the propagation algorithm of Eq. 28 on the features. Moreover, for the graph convolutional models SGC and GCN, we use the SAGE-GCN variant [22] that subsamples each vertex’s neighborhood to lower the computational cost, with maximum subsampling size equal to . The neural network parameters are trained using the ADAMW optimizer with learning rate and decay rate of . The hyperparameters in label propagation, LGC, residual propagation, number of layers in SGC and GCN, and number of training epochs are tuned with 5-fold cross validation on the training set.

For each attributed graph, there are five attributes, and we use each type as either a feature or label in separate experiments. In each experiment, we choose a single attribute as the outcome and use the remaining four as features for prediction. Each combination of dataset, choice of label, and prediction algorithm is repeated 10 times with different random splits. Since the synthetic data generation process does not distinguish different attributes, we further average the accuracies across different choices of labels for the same dataset and prediction algorithm. The performances of different methods as well as the optimal hyperparameters are summarized in Table 1.

LP () LR LGC () SGC () GCN () LGC/RP () SGC/RP () GCN/RP ()
0.19 (0.79) 0.68 0.70 (0.28) 0.37 (1.8) 0.34 (1.7) 0.73 (0.29) 0.40 (1.8, 0.21) 0.37 (1.7, 0.21)
0.43 (0.95) 0.48 0.58 (0.57) 0.45 (2.1) 0.45 (2.0) 0.68 (0.56) 0.56 (2.1, 0.46) 0.54 (2.0, 0.43)
0.59 (0.99) 0.24 0.42 (0.85) 0.38 (2.3) 0.45 (2.5) 0.64 (0.85) 0.63 (2.3, 0.81) 0.62 (2.5, 0.79)
Table 1: Transductive learning accuracy on synthetic datasets as measured by the coefficient of determination (). Each reported accuracy is averaged over 5 different choices of label, each repeated 10 times with different random data splits. We also report the values and number of layers that give the highest validation accuracy, averaged over the experiments. Higher levels of homophily (larger ) lead to more smoothing (larger or ).

Main accuracy results.  The algorithms in Table 1 can be divided into three groups based on the information they use: \⃝raisebox{-0.9pt}{1} label propagation (LP) that only uses outcome homophily; \⃝raisebox{-0.9pt}{2} feature-based methods LR, LGC, SGC and GCN that use the correlation between (transformed) features and outcome; and \⃝raisebox{-0.9pt}{3} hybrid methods LGC/RP, SGC/RP and GCN/RP that account for both.

Comparing the performances across these different groups of algorithms, different algorithms are favored for different datasets. For data sampled with a small (), the feature correlations are more important; in this case, the feature-based methods are far superior to LP. On the other hand, when , the outcome homophily is more influential, and LP out-performs the feature-based models. For the intermediate value , LGC and LP have comparable accuracy, and the hybrid method LGC/RP is able to improve over both algorithms by a large margin. Finally, the hybrid methods are strictly better than both LP and their feature-only counterparts, even though SGC/RP and GCN/RP are just heuristic algorithms.

Comparing performance within the feature-based methods (group \⃝raisebox{-0.9pt}{2}), the feature smoothing step in LGC always provides improvement over just LR. This is not the case for SGC, which has serious performance degradation on attributed graphs sampled with low homophily levels (), and linear regression is a better choice. One may attribute the performance difference between LGC and SGC to the fact that LGC is derived from the model class from which the data is sampled. However, we will see that LGC also performs better on real-world datasets, and we provide some intuition for this shortly.

When comparing the performance within the hybrid methods (group \⃝raisebox{-0.9pt}{3}), LGC/RP always gives the highest regression accuracy. This is entirely expected given we have shown that the LGC/RP computes the expected unknown outcomes conditioned on all available information, and thus should be the optimal algorithm measured by the sum-of-the-square error (or coefficient).

Interpretation of optimal smoothing parameters.  In addition to the regression accuracies, Table 1 also reports the optimal hyperparameters of different learning algorithms on the three attributed graphs. As the homophily level between neighboring vertices increases with , we observe that the optimal smoothing parameters in LP and RP increases, which is consistent with our theory in Sections 2.5 and 2.3. For feature smoothing in LGC, the optimal smoothing parameter also increases with , as the algorithms place more weight on the neighboring nodes to reduce noise. Similarly, SGC and GCN perform better with more convolutional layers, meaning that they use larger neighborhoods to increase smoothness. While the hyperparameter in LGC can take any real value between 0 and 1, the number of layers in SGC and GCN can only take integer values and is therefore less flexible. This is part of the reason SGC underperforms LGC by a large margin, which we discuss next.

Comparing feature smoothing in LGC and SGC.  We have shown in Section 2.4 that one way to understand the difference between LGC and SGC is that they perform feature smoothing differently. At first glance, featuring smoothing with SGC seems more straightforward: while applying in LGC requires running the propagation in Eq. 28 until convergence, SGC simply performs rounds of weighted averaging with neighbors using . Here we use the graph signal processing framework [50] to analyze both smoothing techniques as low-pass filters, where has much more expressivity, as a way to understand their empirical performance differences.

Figure 2: The frequency response functions for the smoothing filters of LGC and SGC on an arbitrary -regular graph. The color gradient from blue to red corresponds to increasing values from to and from to . There are two limitations of SGC’s smoothing filter : (i) it is not a low-pass filter for eigenvalues and (ii) even on its low-pass spectral range , it does not have enough expressive power.

Let denote the symmetric eigendecomposition of the normalized Laplacian matrix. We interpret each eigenvector as a signal over the graph, where is the signal strength on node . Eigenvectors with small eigenvalues are associated with smooth signals (over the graph) and eigenvectors with large eigenvalues are associated with more oscillatory signals.

Let be a column of corresponding to a vector of features and write , where denotes the coefficient for the th eigenvector. Then, by the spectral mapping theorem,

(43)

where is the new coefficient of the processed signal for the th eigenvector. We can see suppresses the coefficients for eigenvectors with larger eigenvalues, while leaving the coefficient for eigenvector with intact, so it is a low-pass filter on graph signals. Since eigenvectors corresponding to larger eigenvalues are more oscillatory over the graph, supressing them smooths the features.

We write , where is called the frequency response function of the LGC filter, which we will compare with the frequency response function for SGC. For simplicity, assume that the graph is a -regular, which lets us diagonalize with the basis of eigenvectors for :

(44)

The corresponding frequency response function is and corresponds to a low-pass filter for eigenvalues in . Prior research has remarked that suffers from the fact that is it not a low-pass on the entire range of eigenvalues [67, 35], but we point out a more pressing issue here, namely that the filter is not expressive enough even on the range where it is a low-pass filter.

The expressiveness of a low-pass filter can be measured by how well the frequency response function can approximate an arbitrary decaying function. Our design choices are a continuous parameter for LGC and an integer parameter for SGC, which already hints at why LGC might be able to construct better filters. We choose and compare the frequency response functions for the filters of LGC and SGC in Fig. 2. Tuning spans the entire space of convex smoothly decaying functions over the eigenvalues. In contrast, the frequency response function for and are vastly different, and there is no intermediate filter between them, as must takes integer values. When smoothing features, one has to trade off the desire to reduce noise (increase smoothness) and preserving the signal (decrease smoothness). The LGC filter can always choose to find a sweet spot in this trade-off, whereas the SGC filter offers limited flexibility; in some cases, the noise hurts prediction accuracy without feature smoothing (), whereas using just may already force too smooth a signal. (Section 3.2 contains a discussion on how to choose an optimal smoothing level .)

Effects of non-linearities.  We can analyze the effects of non-linear activation functions by comparing GCN with SGC. The GCN performs marginally worse than the latter when , yet considerately better when . This might seem strange at first glance, since we know that linear models are optimal for our synthetic data. However, nonlinear functions are still helpful in two ways. First, we have already seen the limited flexibility of SGC as a low-pass filter, and the nonlinearities provide additional modeling power that could be helpful for approximating the right filters. Second, the increased modeling power gives the GCN higher capacity to partially “memorize” the training examples, as evidenced by near-zero loss on the training set. When the outcomes are smooth along the graph structure (larger ), memorizing the features in the neighborhood of each training node and the corresponding label should help with out-of-sample prediction, as there are overlaps between the neighborhoods of the training and testing vertices. This is conceptually similar with the common practice of setting when no features are available [29], where makes the initial-layer embedding of two nodes similar if they have similar sets of neighbors. This type of memorization has downsides. Although the GCN performs better than the SGC here in the transductive setting when , we will show in Section 4 that it performs worse in the inductive setting on an unseen graphs sampled under the exact same parameters, simply because the training and testing vertices no longer share any neighbors.

3.2 Identifiability, over-smoothing, and under-smoothing

\phantomcaption
\phantomcaption
\phantomcaption
Figure 3: (a) Performance of label propagation on synthetic labeled graphs (). The vertex labels in each synthetic graph are drawn from a Gaussian MRF with precision matrix for various . For each sampled graph, we run label propagation with different parameters and measure the regression accuracy with . For each Gaussian MRF , the yellow circle marks the value for LP that gives the largest . The inset illustrates the performance along a fixed used to generate the data shows, corresponding to the vertical black dashed line. Most circles are on the diagonal yellow line, which confirms that the optimal LP parameter is essentially the Gaussian MRF used for sampling vertex labels. (b) Performance of linear GCN on synthetic attributed graphs (). The vertex attributes in each synthetic graph are drawn from a Gaussian MRF with certain parameters ; the optimal smoothing parameter is . For each sampled graph, we run LGC algorithm with different parameters. The LGC parameter that gives the highest is always close to the used to generate the data. (c) The performance of LGC with residual propagation post-processing on the same synthetic attributed graphs as in (b). Again, the LGC/RP parameter that gives the highest matches the optimal used to generate the data.

We have made the expressibility argument for LGC (over SGC) through theory and experiments. Next, we focus on an identifiability problem — can we recover the true ? Indeed, we see that simple hyperparameter tuning strategies are sufficient for synthetic data, motivating the same strategy for real-world data. To demonstrate this, we again use synthetic dataset generated from our attributed graph model, where the optimal smoothing level is determined by the parameter values of and used during sampling (see Section 2).

We start with the label propagation algorithm. Typically, the parameter is chosen to be relatively large, such as  [72], which corresponds to . This may be sufficient for the classification setting, where a “score” for each class is generated for each node, and all that matters for accuracy is whether or not the correct class has the highest score. In the regression setting, where the performance is measured with squared error, choosing a good hyperparameter is crucial. For synthetic data generation, we use the same graph topology as in Section 3.1 and consider different parameters of our model. We set , so that there are no features, and the Gaussian MRF precision matrix reduces to . Since an overall scaling in all the outcomes does not influence the regression performance, we further simplify with . We generate synthetic labeled graphs under a range of values from to . For each labeled graph, we randomly select of the vertices for training and test the regression accuracy on the remaining vertices, using estimates generated via the procedure in Eq. 19 with a range of values. These values are our guesses for the “true ” and the estimate would be the one that gives the best out-of-sample accuracy.

Each combination of Gaussian MRF (used for data generation) and label propagation (used for prediction) is repeated times with different random seeds, and the mean accuracy is plotted in Fig. 3. The empirical optimal label propagation under each Gaussian MRF is marked with a yellow dot. We would hope that label propagation works the best when its estimate matches the Gaussian MRF parameter used for sampling. This is confirmed by the fact that most yellow dots in Fig. 3 are located along the diagonal. In other words, tuning the parameters in the label propagation algorithm can indeed find the optimal smoothing level.

We ran analogous experiments for LGC and LGC/RP. In this case, the Gaussian MRF has two attributes (one feature and one label) with parameters and . We fix the matrix elements and vary the homophily strength from to . The theoretically optimal smoothing parameter for LGC and LGC/RP (given by ) also varies from to . We again sample attributed graphs from the Gaussian MRF model, use 30% of vertices for training and 70% for testing, average over 30 random samples, and compare whether the estimated (by selecting the that gives the best accuracy) matches the used to generate the data. Similar to the experiments on LP, we find the estimated for LGC and LGC/RP always approximately match the used to generate the data (Figs. 3 and 3).

When the homophily level in the dataset is not overwhelming large (e.g., the data generation process uses ), running LGC (or LGC/RP) with a larger estimated causes significant performance degradation, as shown by the top-left corner of each subplot. This corresponds to the aforementioned “over-smoothing” phenomenon in the practice of graph neural networks [34, 49, 70]. Our data model makes it clear when and why this could be a real problem. The LGC model can avoid over-smoothing by finding the appropriate parameter on the validation set, but the same may not not true for other graph neural networks (we saw this with SGC in Fig. 2). Similarly, the bottom-right corner of the plot corresponds to an “under-smoothing” region, where the algorithms would not take full advantage of the homophily levels to denoise the sample; our attributed graph model shows that this can happen, but this issue has not been studied in the graph neural networks literature.

When comparing Fig. 3 and Fig. 3, one notable difference is that the absolute performance of LGC (with optimal ) decreases as the Gaussian MRF increases. The reason for this is that the sampled attributed graph is increasingly dominated by homophily between neighbors, and LGC ignores the outcome correlation. In contrast, LGC/RP still performs well in the high region, as it captures outcome correlation with residual propagation.

3.3 Experiments on real-world datasets

Dataset # vertices # edges # attributes
U.S. 3,106 22,574 10
CDC 3,107 9,230 5
London 608 1,750 4
Twitch 1,912 31,299 66
Table 2: Summary of real-world datasets.

Now that we have understood the regression performances of different algorithms on the synthetic datasets, we turn our attention to real-world attributed graphs. We consider the following attributed graphs with real-valued vertex attributes. The statistics of these datasets are summarized in Table 2.

U.S. election map. Each vertex in this dataset represents a county in the United States [26]. The vertex attributes are demographic information (e.g., median household income, education level, and unemployment rate) and election margins of victory in each county during the 2016 presidential election. In addition, for each county, we add three extra features from the Facebook social connectedness index dataset [3, 4], measuring the average share of friends living within 50 miles, 100 miles and 500 miles for Facebook users in that county. The graph topology is also taken from the Facebook dataset, where each county is connected to approximately other counties with the strongest social ties. This social graph is non-planer due to long range friendships (e.g., Wayne County, Ohio is connected to Loving County, Texas).

CDC climate data. Here, vertices correspond to counties in the United States and edges connect counties that share a physical border. The attributes in each county are real-valued environmental indicators including air temperature, land temperature, precipitation, sunlight, and pm2.5 averaged over the year of 2008 [48].

London election map. In this dataset, the vertices represent wards in London, and the edges encode adjacent wards [21]. The vertex attributes are demographic information (income, education level, and average age) and election margin of victory in each ward during the 2016 mayor election.

Twitch social network. This is a friendship network amongst Twitch streamers in Portugal [56]. The raw vertex features are binary variables such as the games played and liked, and streaming habits. We project these raw features onto the first 64 principal components to have 64 real-valued features at each node. This graph exhibits very high node degree heterogeneity, so we also include the square root of vertex degree as an additional feature. The outcome is the number of days each streamer has been streaming on Twitch.

Dataset Outcome LP LR LGC () SGC () GCN () LGC/RP SGC/RP GCN/RP
U.S. income 0.40 0.63 0.66 (0.46) 0.51 (1.0) 0.53 (1.3) 0.69 0.55 0.55
education 0.31 0.71 0.71 (0.00) 0.43 (1.0) 0.47 (1.0) 0.71 0.46 0.48
unemployment 0.47 0.34 0.39 (0.59) 0.32 (1.3) 0.45 (2.5) 0.54 0.52 0.53
election 0.52 0.42 0.49 (0.68) 0.43 (1.1) 0.52 (2.1) 0.64 0.61 0.61
CDC airT 0.95 0.85 0.86 (0.78) 0.86 (2.6) 0.95 (3.0) 0.96 0.97 0.97
landT 0.89 0.81 0.81 (0.09) 0.79 (1.0) 0.91 (2.4) 0.90 0.93 0.93
precipitation 0.89 0.59 0.61 (0.93) 0.61 (2.3) 0.79 (3.0) 0.89 0.90 0.90
sunlight 0.96 0.75 0.81 (0.97) 0.80 (3.0) 0.90 (3.0) 0.96 0.97 0.97
pm2.5 0.96 0.21 0.27 (0.99) 0.23 (2.7) 0.78 (3.0) 0.96 0.96 0.97
London income 0.46 0.85 0.85 (0.00) 0.64 (1.0) 0.63 (1.0) 0.85 0.65 0.64
education 0.65 0.81 0.83 (0.40) 0.74 (1.6) 0.79 (1.4) 0.86 0.77 0.79
age 0.65 0.73 0.73 (0.17) 0.66 (1.2) 0.70 (1.7) 0.75 0.72 0.72
election 0.67 0.73 0.81 (0.74) 0.74 (2.0) 0.76 (2.1) 0.85 0.78 0.78
Twitch days 0.08 0.58 0.59 (0.67) 0.22 (1.4) 0.26 (1.7) 0.60 0.23 0.26
Table 3: Transductive learning accuracies on real-world datasets as measured by the coefficient of determination (). Each reported accuracy is averaged over 10 runs with different random data splits. Optimal hyperparameters for LGC () and SGC/GCN () are reported, and are positively correlated. Even though this data is not sampled from our Gaussian MRF model, the LGC/RP algorithm derived from that model is accurate.

Main accuracy results.  We use the same experiment setup as we used for the synthetic datasets in Section 3.1, except that we no longer perform averaging over different choice of labels when reporting the performance on the same attributed graph (Table 3). Although the vertex attributes are no longer sampled from a Gaussian MRF, LGC still consistently outperforms SGC, further confirming our argument that is a more expressive smoothing filter. On the other hand, while GCN performs only slightly better than SGC on synthetic data, the extra non-linearity in GCN does provide more performance gains over SGC for the real-world datasets we considered here (e.g., predicting pm2.5 in the CDC climate dataset). LGC still outperforms GCN in out of prediction tasks shown above, which highlights the importance of finding an optimal level of feature smoothing. Finally, the most accurate methods are the hybrid methods, as they accounts for both feature correlation and outcome homophily.

Figure 4: Transductive accuracies as a function of the fraction of nodes that are labeled. LP accuracy often increases considerably with the size of the training set, whereas LGC and SGC benefit relatively less from more training examples.

Impact of the training set size.  We pick one choice of label in each dataset and track how the regression accuracies of different algorithms changes as the training split ratio increases from 10% to 60%. We use the same experiment setup and training method, and we tune the hyperparameters on the training nodes with 5-fold cross validation. We evaluate LP, LGC and LGC/RP, which represent three groups of algorithms based on outcome homophily, feature correlation, or both, and also include SGC and GCN for comparison. The results are summarized in Fig. 4. First, we see that different groups of algorithms are suitable for different datasets. While LP consistently outperforms LGC on the CDC dataset, the opposite is true on the Twitch dataset. Secondly, the impact of increased training data varies across different groups of algorithms. While the accuracy of LP typically increases substantially with the number of observed labels, the accuracies of LGC and SGC are relatively constant. Since LGC and SGC are inductive learning methods with limited capacity, they benefit little from additional examples once the training set reaches a certain size. On the other hand, the performance of GCN moderately increase with the number of training nodes. This is more evidence that the GCN’s high model capacity allows it to partially memorize the training examples. However, the memorization mechanism generalizes poorly to nodes in a different graph, as we demonstrate in the next section.

4 Inductive learning experiments

Next, we consider the inductive learning setting, where a model is trained on one attributed graphs and tested on another. This setting is useful when both graphs have similar properties, but vertex labels in the testing one are difficult or expensive to obtain. We start with synthetic data where the training and testing graphs are two samples from our model. Then we look at a real-world example based on U.S. election outcomes in different years.

4.1 Experiments on synthetic datasets

Figure 5: The inductive learning accuracies on synthetic attributed graphs sampled from Gaussian MRF with three different values. The transductive learning accuracies are also included for comparison. Each reported accuracy is averaged over 5 different choices of label, each repeated 10 times with different random data splits. On the attributed graph sampled with or , the inductive performances of the four algorithms are close to their transductive counterparts. When , correlation in the outcome dominates the correlation between features and outcome, and the performance gap widens. In this setting, the GCN outperforms LGC and SGC in the transductive setting, but performs poorly in the inductive setting, which we argue is based on a neighborhood memorization mechanism that does not generalize to nodes in an unseen graph.

Experimental setup.  The synthetic data generation uses the same Gaussian MRF as in Section 3.1, except that in each experiment two attributed graphs are generated, which we denote by and . In the transductive experiment, we used 30% labels from for training, and the rest 70% labels for testing. Here, we instead use 30% labels from for training, and test on the same 70% labels in as the transductive case. Since we assume no labels in are available, we only consider the LR, LGC, SGC and GCN algorithms and not RP variants or LP. For each algorithm, we use the same optimizer and hyperparameter tuning methods as described Section 3.1. We average the regression accuracies over repeated experiments as well as different choice of labels. The results are summarized in Fig. 5, which also shows the transductive learning accuracies for comparison.

Results.  First of all, we expect that the inductive learning accuracies will be worse (or at best the same) compared to their transductive counterparts, which is exactly what we observe in Fig. 5. Moreover, the gaps between inductive and transductive accuracies vary greatly as the homophily level dictated by changes. While the inductive and transductive accuracies are almost identical on graphs sampled under , all four algorithms perform much worse in the inductive setting when . This is because when is small, the correlation between features and outcome is the most influential, and that correlation is captured by the learning algorithms. On the other hand, the outcome homophily becomes the dominating correlation when , but none of the learning algorithms can take advantage of outcome correlation in , as the labels are unavailable by assumption. Secondly, while the transductive accuracies of GCN is comparable to LGC and SGC, it does not generalize as good as those linear methods. In particular, when , GCN outperforms SGC in the transductive setting but underperforms SGC by a large margin in the inductive setting. As discussed in Section 3.3, the GCN achieves near zero training error, and the GCN is partially memorizing the neighborhood of the training nodes. This generalizes well to testing nodes in the same graph as they share neighbors but not to testing nodes in a different graph, as evidenced by the degradation in performance. Finally, LGC gives the highest inductive learning accuracies on all three datasets, which is expected because the algorithm is based on optimizing for the model class from which the data was generated.

4.2 Experiments on real-world datasets

Next, we compare the inductive learning capability of LR, LGC, SGC, and GCN on the U.S. election map data. In our transductive experiments, we used data from the 2016 election. We constructed an analogous dataset for 2012, where features and edges from the Facebook social connectedness index are identical, but the demographic features and election outcomes are different. Similar to the synthetic experiment, the learning algorithms are trained on of labels from the year 2012, and tested on of the nodes in the year . We choose four attributes (median household income, education level, unemployment rate and election outcome) as the label for prediction in different experiments, and the results are summarized in Fig. 6. We first observe that the inductive accuracies for income and education prediction are close to the corresponding transductive accuracies, partially because those statistics are relatively stable over time, so the models trained on 2012 data provide good estimates for 2016. When predicting attributes that changed substantially — such as unemployment rate and election outcome — the GCN generalizes worse than the linear methods LGC and GCN, as shown by the larger gap between its transductive and inductive performance. Overall, LGC achieves the highest inductive learning accuracies.

Figure 6: The inductive learning accuracies for the U.S. election map with different choices of label. The transductive learning accuracies are also included for comparison. Each reported accuracy is averaged over 10 runs with different random data splits. GCN performs well for predicting unemployment and election in the transductive setting, but does not generalize as well as the linear methods to an unseen graph. LGC overall gives the highest inductive accuracies.
year sh050m sh100m sh500m income migration birth death education unemployment
2012 0.06 -0.42 0.24 0.22 0.16 -0.13 0.04 -0.90 -0.38
2016 -0.02 -0.38 0.22 0.70 0.21 -0.13 0.51 -1.53 -0.39
Table 4: LGC regression coefficients for county level margin of victory in 2012 and 2016 presidential elections. The margin of victory can be either positive (republican won) or negative (democrat won). The features consist of 3 social features sh050m, sh100m, sh500m, representing the average shares of Facebook friends living within 50, 100, and 500 miles, as well as 6 demographic features for each county such as median household income, migration rate, etc.

Interpreting LGC coefficients.  Another advantage of LGC (and linear methods in general) over GCNs is interpretability, as we can inspect the regression coefficients. We average the LGC regression coefficients (i.e., in Eq. 25) over different runs for the election outcome prediction, trained on either 2012 or 2016 data, which are reported in Table 4. Here, the election outcomes are measured by the margin of victory in each county, which can be either positive (republican won) or negative (democrat won). There are social features (sh050m, sh100m, sh500m, representing the share of Facebook friends living within 50 miles, 100 miles and 500 miles) and several demographic features (income, migration rate, birth rate, death rate, education level, and unemployment rate). All features are normalized to have zero mean and unit standard deviation.

From the regression coefficients using the 2016 data, the features with the most positive and negative coefficients are income and education. This suggests that republican voting counties are more likely to have a higher median income and lower education level. When comparing the regression coefficients trained on 2016 vs. 2012 data, the features with the largest change of coefficients are income, education and death rate. Since a higher death rate in general indicate an older population, this suggests that republican voters in 2016 were in general richer, older, and less educated, compared to 2012. Finally, the social feature sh100m is associated with a large negative coefficient and sh500 a positive coefficient, meaning that republication voters in general might have more long-distance friends. This aligns with the fact that many rural and sparsely populated counties lean republican in recent U.S. elections.

Data-split LP LR LGC SGC GCN LGC/RP SGC/RP GCN/RP
transductive 0.70 0.62 0.61 0.59 0.79 0.71 0.73 0.83
inductive 0.37 0.31 0.30 0.46
Table 5: Transductive and inductive classification accuracy on the Elliptic Bitcoin dataset as measured by the F1 score on the illicit class. Residual propagation is helpful in the transductive setting, and the GCN performs well.

4.3 Extension to classification

Although all of the algorithms we have derived from our generative model are meant for regression problems, they can be adapted for classification by adapted for classification of discrete classes. In particular, for binary classification, one can simply choose a cut-off threshold for positive verses negative classes. As a proof-of-concept example, we consider the Elliptic Bitcoin dataset [51], where nodes represent transactions and edges correspond to payment flows. The outcome is a “licit” or “illicit” class label for each transaction. In this dataset, each node is associated with one of 49 time steps, where nodes from different time steps are disconnected. This allow us to compare the inductive and transductive learning performances with different data splitting methods. For the inductive setting, we use the same setup as Pareja et al. [51], splitting nodes by time order into training/validation/testing sets with 31/5/13 time steps; in this setting, label propagation and residual propagation are not useful because training nodes are disconnected with testing nodes. For the transductive setting, we use the same proportions of training, validation, testing data, but nodes are assigned to each uniformly at random; this setting allows us to demonstrate the effect of label homophily.

During training, we treat the prediction problem as a regression task, where we map the illicit class into a real-valued outcome and licit class to .4 For prediction, each node is first assigned a real-valued outcome estimation, then categorized into either one of the two classes depending on whether the outcome exceeds the cut-off threshold. We treat the cut-off threshold as an additional hyperparameter, and tune it together with the others on the validation set by maximizing the F1 score for the illicit class. Finally, we measure the classification accuracy by the F1 score for the illicit class on the testing data. We summarize the results in Table 5. First, all four algorithms (LR, LGC, SGC, and GCN) perform worse in the inductive setting than the transductive setting, as the correlation between features and labels changes over time. Second, GCN outperforms LGC and SGC by a large margin in both settings, so nonlinear relations between features and labels plays an important role. At last, while residual propagation does not help prediction in the inductive setting since the training and testing nodes are disconnected, it improves the classification accuracy of all the graph learning algorithms (LGC, SGC, and GCN) considerably in the transductive setting. This experiment shows that our residual propagation post-processing can potentially help transductive node classification tasks as well. Developing generative models similar to Gaussian MRF that directly models categorical attributes is an interesting avenue of future research.

5 Fitting attributed graph datasets to the generative model

So far we have focused on the algorithms derived from our Gaussian MRF model, where we used outcome homophily and feature correlation to qualitatively explain their performance differences on various datasets. Next, we directly fit an attributed graph to our generative model. The learned Gaussian MRF can be used to generate graphs that closely resemble the original input, or quantitatively estimate the prediction accuracies of LP, LGC, and LGC/RP.

5.1 Maximum likelihood estimation

Here, we assume that all of the vertex attribute values are given, and our goal is to find the Gaussian MRF model that best explains how those attributes are sampled. To this end, we use maximum likelihood estimation to fit the Gaussian MRF parameters. The objective function is the negative log-likelihood:

(45)
(46)

The derivatives with respect to the model parameters are then