Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data

# Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data

Victor Veitch
&Morgane Austernfootnotemark:
&Wenda Zhoufootnotemark:
&David M. Blei
&Peter Orbanz
equal contribution
###### Abstract

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

\addbibresource

graph_embeddings

## 1 Introduction

Relational data is data that can be represented as a graph (e.g. the link graph of a social network, or user/movie ratings), possibly annotated with additional information (e.g. user profiles). We consider prediction problems for relational data. Many prediction methods used in machine learning are based on empirical risk minimization (ERM) Vapnik:1992; Vapnik:1995; Shalev-Shwartz:Ben-David:2014. However, ERM inherently assumes that data are i.i.d. This assumption is meaningless for relational data; hence, generalizing ERM to relational data remains an unsolved problem.

To address this problem, we draw on recent work in statistics and applied probability that emphasizes the role of sampling theory in modeling graph data Orbanz:2017; Veitch:Roy:2016; Borgs:Chayes:Cohn:Veitch:2017; Crane:Dempsey:2016:snm. In Section 2, we explain how risk and empirical risk can be defined for relational data. The definition is based on a specific choice of a randomized algorithm that samples data from a graph. Different sampling algorithms yield different notions of relational empirical risk, and we review several possible choices. For large datasets, the classical ERM problem is usually solved with stochastic gradient descent (SGD), since unbiased estimates of the empirical risk can be efficiently computed by uniformly subsampling the data. In Section 2.2, we show that unbiased stochastic gradients of the relational empirical risk can be efficiently computed by replacing uniform sampling with an appropriate surrogate. This results in an efficient plug-and-play SGD algorithm for solving the minimization problem. In Section 3, we detail two concrete examples, and we draw connections to the graph embeddings literature (e.g Perozzi:Al-Rfou:Skiena:2014; Grover:Leskovec:2016; Tang:Qu:Wang:Zhang:Yan:Mei:2015; Hamilton:Ying:Leskovec:2017:review). We derive theoretical results in Section 4 that clarify the fundamental role of the choice of sampling algorithm. In Section 5, we study relational ERM empirically. We observe that (i) the choice of sampling scheme has a substantial effect in practice, (ii) novel models can be easily fit using SGD with relational ERM, and (iii) combining these two observations leads to state-of-the-art performance on semi-supervised vertex label prediction tasks. We provide fast implementations of a variety of graph sampling algorithms and integration with TensorFlow.

## 2 Empirical risk and subsampling

Here is the classical setup of ERM: A single observation is represented by a random variable . Associated with is a label , which may or may not be observed. We denote the “complete” observation . A predictor is a function that maps observations to labels. For an element of the sample space, it is typically written as , where is a predicted label. Anticipating relational data, we instead write and , that is, the predictor “completes” unlabeled information to labeled information . How well reconstructs is measured by the value of a loss function with values in .

ERM typically assumes a sample of observations to be generated i.i.d. from some probability distribution . The risk of is the expectation of the loss under this distribution,

 R(π):=E¯¯¯¯X∼P[L(π(X),¯¯¯¯¯X)]. (1)

The empirical distribution of the sample is , where is either the Dirac function at (if the set in which takes its values is uncountable), or the indicator (on a discrete space). The empirical risk of substitutes for ,

 ^R(π,¯¯¯Sn):=E¯¯¯¯X∼F(¯¯¯Sn)[L(π(X),¯¯¯¯¯X)|¯¯¯Sn]=1n∑i≤nL(π(Xi),¯¯¯¯¯Xi). (2)

Since the observed sample is random, is random. To choose a predictor , we posit a hypothesis class of predictors indexed by a parameter with values in a parameter space . Empirical risk minimization selects a predictor as

 ^π:=π^θn where ^θn:=argminθ^R(πθ,Sn).

ERM is statistically sound in that converges to a value determined by . This property follows from the fact that are i.i.d. (Devroye:Gyorfi:Lugosi:1996).

### 2.1 Empirical risk for relational data

Classical ERM underpins many machine learning algorithms, but it tacitly assumes data are i.i.d. from some distribution. For relational data, the i.i.d. assumption is meaningless. We now consider how to define a useful notion of empirical risk for relational data.

We model relational data as graphs. Instead of the sample above, we observe a graph of size (for example, the number of vertices or edges), possibly annotated, e.g., by vertex labels. This graph is assumed to represent a small part of a large, underlying graph (e.g., an entire social network). For relational data, the predictor is a function whose input is an incomplete version of , which it augments with an estimate of the missing information, such as vertex labels or missing edges.

To generalize the empirical risk, we first re-examine its definition above: Suppose the observations are generated by randomly selecting examples from some large, underlying set . In statistics, is called a population. We generate a sample as follows:

###### Algorithm 1 (Sampling with replacement).
 i.) Select n elements ¯¯¯¯¯Xi of X independently and uniformly with replacement. ii.) Report ¯¯¯Sn=(¯¯¯¯¯X1,…,¯¯¯¯¯Xn).

Then are i.i.d., with distribution determined by . If is small, one has to distinguish carefully between sampling with and without replacement, but for large populations, the two become indistinguishable. In the so-called infinite population limit , every distribution arises in this manner. Modeling data as i.i.d. can thus be justified as sampling from a very large population using Algorithm 1. The empirical risk (2) and risk (1) can thus be rewritten as

 ^R(π,¯¯¯Sn)=E¯¯¯S1∼¯¯¯Sn[L(π(S1),¯¯¯S1)] and R(π)=E¯¯¯S1∼X[L(π(S1),¯¯¯S1)]=E¯¯¯Sn∼X[^R(π,¯¯¯Sn)].

See e.g. Politis:Romano:Wolf:1999 for a rigorous justification. Note that the empirical risk involves two sampling steps: A data acquisition step that generates an observed sample of data by drawing from the (unobserved) underlying population, and a draw from from the empirical measure.

We use this perspective on classical ERM to define ERM for relational data. In this case, the population is replaced by the population graph . Again, there are two distributions at play. In the data acquisition step, we replace Algorithm 1 by a suitable algorithm that samples a random subgraph from an input graph . An observed graph is then modeled as such a sample from a large, unknown population graph . Specific choices for are discussed in Section 2.3 below. As in the i.i.d. case above, an infinite-population limit can be made rigorous Orbanz:2017; Borgs:Chayes:Cohn:Veitch:2017.

The second sampling step—which corresponds to sampling from the empirical distribution in the i.i.d. case—uses sampling algorithm , which may or may not be identical to . We denote this second sampling step as a subsampling routine

 Sample(¯¯¯¯Gn,k):=¯¯¯¯Gk∼B¯¯¯¯Gn.

The distinction in notation is made to emphasize that is a modeling assumption on how observed data has been acquired, whereas is a definition, chosen by the analyst and implemented in code. In analogy to the sampling representation of the risk above, we define

 ^Rk(π,¯¯¯¯Gn):=E¯¯¯¯Gk=Sample(¯¯¯¯Gn,k)[L(π(Gk),¯¯¯¯Gk)∣¯¯¯¯Gn] and Rk(π)=E¯¯¯¯Gn∼G[^Rk(π,¯¯¯¯Gn)]. (3)

We call the relational risk, and the relational empirical risk. Relational empirical risk minimization selects a predictor as

 ^π:=π^θn where ^θn:=argminθ^Rk(πθ,¯¯¯¯Gn). (4)

In summary, a relational ERM model is defined by three ingredients:

1. A class of predictors with parameter .

2. A loss function .

3. A sampling routine .

Given observed data , a model is fit by solving (4).

### 2.2 Stochastic gradient descent

For relational ERM to be useful in practice, the minimization problem Eq. 4 must be (approximately) solvable in a reasonable amount of time. This is possible through stochastic gradient descent (SGD). SGD requires unbiased estimates of the gradient of the objective function—in this case, the relational empirical risk. We define a stochastic gradient as , the gradient of the loss computed on a sample of size drawn with . The key observation is

 ∇θ^Rr(θ;Gn)=∇θE[L(Sample(Gn,k),θ)∣Gn]=E[∇θL(Sample(Gn,k);θ)∣Gn].

That is, the random gradient is an unbiased estimator of the gradient of the full relational empirical risk. If is computationally efficient, then relational ERM can be solved using SGD with this stochastic estimator. Combined with automatic differentiation, the resulting algorithm is an effective and generically applicable solver for the minimization problem.

### 2.3 Subsampling algorithms

In classical ERM, sampling uniformly (with or without replacement) is typically the only choice. In contrast, there are many ways to sample from a graph. Each such sampling algorithm leads to a different notion of risk and empirical risk in (3). This section describes some possibilities.

Perhaps the closest analogue of Algorithm 1 for graphs is uniform selection of vertices:

###### Algorithm 2 (Unifom vertex sampling).
i.) Select n vertices of g independently and uniformly without replacement. Extract the induced subgraph ¯¯¯¯Gn of g on these vertices. Label the vertices of ¯¯¯¯Gn by 1,…,k in order of appearance.

The input graph may either represent a population , or a previously extracted sample . Algorithm 2 is simple, but often problematic, since it is not suitable for sparse graphs. There are various definitions of sparsity, but they all have in common that the fraction of edges present in the graph approaches zero as the graph grows. The expected number of edges reported by Algorithm 2 is , which vanishes for large graphs. Algorithms for sparse data must mitigate this problem. The next algorithm retains only non-empty portions of the graph:

###### Algorithm 3 (p-sampling Veitch:Roy:2016).
i.) Select each vertex in g independently, with a fixed probability p∈[0,1]. Extract the induced subgraph ¯¯¯¯Gn of g on the selected vertices. Delete all isolated vertices from ¯¯¯¯Gn, and report the resulting graph.

The main difference to Algorithm 2 is the deletion step (iii).

###### Algorithm 4 (Uniform edge sampling).
i.) Select n edges in g uniformly and independently from the edge set. Report the graph ¯¯¯¯Gn consisting of these edges, and all vertices incident to these edges.

Recall that a simple random walk of length on a graph selects vertices by starting at a given vertex , and drawing each vertex uniformly from the neighbors of . Typically, a random walk sample is augmented with additional edges, either as part of the data collection procedure, or to increase the efficiency of stochastic gradient descent. We consider two strategies. The first fills in the entire induced subgraph:

###### Algorithm 5 (Random walk: Induced).
i.) Sample a random walk v1,…,vk starting at a uniformly selected vertex of g. Report ¯¯¯¯Gn as the edge list of the vertex induced subgraph of the walk.

The second strategy augments the walk by hallucinating plausible additional edges.

###### Algorithm 6 (Random walk: Skipgram Perozzi:Al-Rfou:Skiena:2014).
i.) ii.) Sample a random walk v1,…,vk starting at a uniformly selected vertex of g. Report ¯¯¯¯Gn={(vi,vj):d(vi,vj)

This algorithm relates to the Skipgram model Mikolov:Chen:Corrado:Dean:2013, and is commonly used in graph embeddings, e.g. DeepWalk Perozzi:Al-Rfou:Skiena:2014 and its successors.

###### Remark 2.1.

These algorithms presuppose graph data, but all can be generalized to “bipartite” relational data, where the rows and columns of an input matrix represent different entities (for example, users and movies). In this case, Algorithm 2 would select rows and columns, Algorithm 3 would use two parameters and , Algorithm 4 would still select entries of the matrix and report each along with a row and column identifier. Random walks can also be transcribed to this setting, although there is not a unique way of doing so.

### 2.4 Negative sampling

For a pair of vertices in an input graph , a sampling algorithm can report three types of edge information: The edge may be observed as present, observed as absent (a non-edge), or may not be observed Aside from Algorithm 2, the algorithms above do not treat edge and non-edge information equally: Algorithms 6, 5 and 4 cannot report non-edges, and the deletion step in Algorithm 3 biases it towards edges over non-edges. However, the locations of non-edges can carry significant information.

Negative sampling schemes are “add-on” algorithms that are applied to the output of a graph sampling algorithm and augment it by non-edge information. Let denote a sample generated by one of the algorithms above from an input graph .

###### Algorithm A (Negative sampling: Induced subgraph).
 i.) Report the subgraph induced by ¯¯¯¯Gn, in the input graph g from which ¯¯¯¯Gn was drawn.

Another method Mikolov:Sutskever:Chen:Corrado:2013; Goldberg:Levy:2014 is based on the unigram distribution: Define a probability distribution on the vertex set of by , the probability that would occur in a separate, independent sample generated from by the same algorithm as . For , we define a distribution , where is the appropriate normalization.

###### Algorithm B (Negative sampling: Unigram).

For each vertex in :
i.) Select vertices independently. ii.) If is a non-edge in , add it to .

creftype B is common in the embeddings literature, where the canonical choice is , see Mikolov:Sutskever:Chen:Corrado:2013.

## 3 Example Models

We now consider two concrete examples of relational ERM, and also discuss relations to graph embedding methods. For the examples, we specify a class of predictors and a loss function . Each can then be used with different choices of . In both examples, we observe a natural split of the parameter into a pair : global parameters shared across the entire graph, and embedding parameters , where each vertex has an associated embedding . Informally, global parameters encode population properties—“people with different political affiliation are less likely to be friends”—and the embeddings encode vertex-specific information—“Bob is a radical vegan.”

##### Semi-supervised node classification

Consider a network where each node is labeled by some binary features—for example, hyperlinked documents labeled by subjects, or interacting proteins labeled by function. The task is to predict the features of a subset of these nodes given the graph structure and the features of the remaining nodes.

The model has the following form: Each vertex is assigned a -dimensional embedding vector . Feature prediction is made by a parameterized function that maps the embedding of each node to the probability that the node has each of the possible features. Let denote the sigmoid function; let denote whether vertex has feature ; and let . The loss on subgraphs is:

 (5)

Here, , , and denote the vertices, edges, and non-edges of the graph respectively. The loss on edge terms is cross-entropy, a standard choice in embedding models Hamilton:Ying:Leskovec:2017:review. Intuitively, the predictor uses the embeddings to predict both the vertex labels and the subgraph structure.

The model is completed by choosing a sampling scheme . Relational ERM then fits the parameters as

 (^λn,^γn)=argminλ,γE[L(Sample(Gn,k),l;λ,γ)∣Gn].

In Section 5 below, we revisit this model and report prediction performance for different choices of .

##### Wikipedia category embeddings

We consider Wikipedia articles joined by hyperlinks. Each article is tagged as a member of one or more categories—for example, “Muscles_of_the_head_and_neck”, “Japanese_rock_music_groups”, or “People_from_Worcester.” The task is to learn latent structure encoding semantic relationships between the categories.

Let denote the hyperlink graph and let denote the categories of article . Each category is assigned an embedding , and the embedding of each article (vertex) is taken to be the sum of the embeddings of its categories, . The loss is

 L(Gk,C;λ)=−∑i,j∈e(Gk)logσ(λTjλi)−∑i,j∈¯e(Gk)log(1−σ(λTjλi)), (6)

where and denote, respectively, the presence and absence of hyperlinks between articles. Intuitively, the predictor uses the category embeddings to predict the hyperlink structure of subgraphs. Relational ERM chooses the embeddings as

 ^γn=argminγE[L(Sample(Gn,k),C;λ(γ))∣Gn].

We write to emphasize that the article embeddings are a function of the category embeddings. Category embeddings obtained with this model are illustrated in Fig. 1; see Section 5 for details on the experiment.

##### Graph representation learning

Methods for learning embeddings of vertices are widely studied; see Hamilton:Ying:Leskovec:2017:review for a recent review. The prototypical graph representation learning algorithm is DeepWalk Perozzi:Al-Rfou:Skiena:2014. The basic approach is to draw a large collection of simple random walks, view each of these walks as a “sentence” where each vertex is a “word”, and learn vertex embeddings by a applying a standard word embedding method Mikolov:Chen:Corrado:Dean:2013; Mikolov:Sutskever:Chen:Corrado:2013. This idea has been extended a number of directions, including using more complicated random walks Grover:Leskovec:2016, incorporating covariate information Hamilton:Ying:Leskovec:2017:inductive, developing GANs for graphs Bojchevski:Shchur:Zugner:Gunnemann:2018. These approaches are state of the art for many tasks.

Graph representation learning algorithms may be viewed as particular cases of relational ERM. For example, informally, DeepWalk is equivalent to a relational ERM model that (i) predicts graph structure using a predictor parameterized only by embedding vectors, (ii) uses a cross entropy loss on graph structure, and (iii) samples by the random-walk skipgram sampler with unigram negative sampling.222DeepWalk uses hierarchical softmax instead of negative sampling, so the correspondence is not literal. The relational ERM perspective allows us to move beyond random walk based sampling. It also permits modifications such as including covariate information or training embeddings and labels simultaneously.

## 4 Theory

Relational ERM involves two sampling procedures: The algorithm used in the data acquisition step , which generates data from an underlying population , and the algorithm used to defined ERM and to execute SGD. Since is a component of model design, different choices of should lead to different results, even in the infinite-data limit. The results below show that this is the case.

To phrase a theoretical result, we must choose specific algorithms. For data acquisition, we consider a population graph with edges. We assume that is “very large,” in the sense that . In other words, any effect that weakens as the graph grows is assumed to be negligible. We assume that an observed sample of size is generated by -sampling from , with , for . The distribution of in the “infinite population” case is well-defined Borgs:Chayes:Cohn:Veitch:2017.

Under this modeling assumption on , we compare two choices of : One is -sampling with . The other is sampling using a simple random walk (Algorithm 5), with walk length . We denote the empirical risk Algorithm 3 defines by , and of Algorithm 5 by .

The first result establishes that the relational empirical risk is a sensible objective function, and that the trained model depends on even in the infinite-data limit.

###### Theorem 4.1.

Suppose that is collected by -sampling as described above. Further suppose technical conditions given in the appendix. Then, for parameter setting satisfying the technical conditions, there are constants such that

 ^Rpsk(¯θ;¯¯¯¯Gn)→cps¯θ^Rrwk(¯θ;¯¯¯¯Gn)→crw¯θ (7)

both in probability and in as . Moreover, there are constants such that

 minθ^Rpsk(θ;¯¯¯¯Gn)→cps∗minθ^Rrwk(θ;¯¯¯¯Gn)→crw∗, (8)

both in probability and in , as .

More details regarding the constants in (7) are given in the appendix; the relevant property is that and are generally distinct. Thus, (7) states that the limits of the -sampling empirical risk and random-walk empirical risk will not agree because the learning procedure depends on the choice of . The same observation holds, as (8) shows, for the minimal empirical risk determined by learning.

The next result strengthens the convergence guarantee. It shows that the estimated global parameters converge in an absolute sense as more data is collected.

###### Theorem 4.2.

Suppose the conditions of Theorem 4.1, and also that the loss function verifies a certain strict convexity property in , given explicitly in the appendix. Let , and similarly for . Then and in probability for some constants and .

Again, in general, and the choice of hence affects learned parameters even in the infinite data limit. In general, need not coincide with the global parameter estimate learned by simultaneously minimizing the global parameters and embeddings. However, because the parameter values learned by simultaneous minimization need not be identifiable, it seems necessary to consider the two stage procedure to establish a simple convergence result.

## 5 Experiments

We empirically study the example models in Section 3, defined by (5) and (6) respectively.3 The models are determined by (5) and (6) up to the choice of . The experiments (i) consider the influence of the choice of sampling scheme; (ii) illustrate its applicability to new tasks; and (iii) evaluate performance.

##### Node classification problems

We begin with the semi-supervised node classification task

described in Section 3, using the model (5) with different choices of . We study the blog catalog and protein-protein interaction data reported in Grover:Leskovec:2016, summarized by the table on the right. Each vertex in the graph has one or more labels, and have their labels censored at training time. The task is to predict these labels at test time.

Two-stage training. We first train the model (5) using no label information to learn the embeddings (that is, with ). We then fit a logistic regression to predict vertex features from the trained embeddings. This two stage approach is a standard testing procedure in the graph embedding literature, e.g. Perozzi:Al-Rfou:Skiena:2014; Grover:Leskovec:2016. We use the same scoring procedure as Node2Vec Grover:Leskovec:2016 and, where applicable, the same hyperparameters. We preprocess the data to remove self-edges, and

restrict each network to the largest connected component. The table on the right shows the effect of varying the sampling scheme used to train the embeddings. SGD succeeds in solving the relational ERM problem for all sampling schemes. As expected, we observe that the choice of sampling scheme affects the embeddings produced via the learning procedure, and thus also the outcome of the experiment. We further observe that sampling non-edges by unigram negative sampling gives better predictive performance relative to selecting non-edges from the vertex induced subgraph.

Simultaneous training. Next, we fit the model of Section 3 with —training the embeddings and global variables simultaneously. We choose label predictor as logistic regression, and adapt the loss to measure the loss only on vertices in the positive sample. We report average macro F1 scores, computed using the node2vec scoring procedure:

Blog catalog Protein-Protein ERM defined by Unif. rw 30.0pt Unif. p-samp p-samp+ns (Alg. 3+B) \cellcolor[gray]0.8 0.30 0.34 0.35 \cellcolor[gray]0.8 0.30 0.37 0.39 rw/skipgram+ns (Alg. 6+B) 0.20 0.26 0.27 0.25 0.32 0.34 Node2Vec (reported) 0.26 - - 30.0pt 0.18 - -

Columns are labeled by the sampling scheme used to draw test vertices. We observe:

• Learning embeddings and logistic regression simultaneously improves performance.

• When training jointly, -sampling outperforms the standard rw/skipgram procedure.

• Labels of nodes selected by random walk or -sampling are easier to predict than those chosen uniformly at random.

Note that the average computed with uniform vertex sampling is the standard scoring procedure used in the previous table.

##### Wikipedia Category Embeddings

Finally, we illustrate an application of relational ERM to a non-standard task. We consider the task of discovering semantic relations between Wikipedia categories, as described in Section 3. We define a relational ERM model by choosing the cost function in (6), and as 6+B, the skipgram random walk sampler with unigram negative sampling. The data is the Wikipedia hyperlink network from Klymko:Gleich:Kolda:2014, consisting of Wikipedia articles from 2011-09-01 restricted to articles in categories containing at least 100 articles. The dataset is relatively large—about 1.8M nodes and 28M edges. We choose embedding dimension . SGD converges in about 90 minutes on a desktop computer equipped with a Nvidia Titan Xp GPU. Fig. 1 on page 1 visualizes example trained embeddings.

## 6 Conclusion

Relational ERM is a generalization of ERM from i.i.d. data to relational data. The key ingredients are explicitly accounting for the sampling scheme by which the data is collected—the analogue for the i.i.d. assumption—and including a data subsampling scheme as a component of modeling design—the analogue of the empirical distribution. Relational ERM models are defined by a loss function, a predictor class, and a sampling scheme. These models can be fit automatically using SGD. Accordingly, relational ERM provides an easy method to specify and fit relational data models, as illustrated in Sections 5 and 3.

The results presented here suggest a number of directions for future inquiry. Foremost: what is the relational analogue of statistical learning theory? The theory derived in the present paper establishes initial results. A more complete treatment may provide statistical guidelines for model development. Our results hinge critically on the assumption that the data is collected by -sampling; it is natural to ask how other data-generating mechanisms can be accommodated. Similarly, it is natural to ask for guidelines for the choice of .

#### Acknowledgments

VV and PO were supported in part by grant FA9550-15-1-0074 of AFOSR. DB is supported by ONR N00014-15-1-2209, ONR 133691-5102004, NIH 5100481-5500001084, NSF CCF-1740833, the Alfred P. Sloan Foundation, the John Simon Guggenheim Foundation, Facebook, Amazon, and IBM. The Titan Xp used for this research was donated by the NVIDIA Corporation.

\printbibliography
{refsection}

## Appendix A Overview of Proofs

The appendix is devoted to proving the theoretical results of the paper. These results are obtained subject to the assumption that the data is collected by -sampling. This assumption is natural in the sense that it provides a reasonable middle ground between a realistic data collection assumption—-sampling can result in complex models capturing many important graph phenomena Caron:Fox:2017; Veitch:Roy:2015; Borgs:Chayes:Cohn:Holden:2016—and mathematical tractability—we are able to establish precise guarantees.

The appendix is organized as follows. We begin by recalling the connection between -sampling and graphex processes in Section B.1; this affords a useful explicit representation of the data generating process. In Section B.2, we recall the method of exchangeable pairs, a technical tool required for our convergence proofs. Next, in Section B.3, we collect the necessary notation and definitions. Empirical risk convergence results for -sampling are then proved in Appendix C and results for the random-walk in Appendix D. Finally, convergence results for the global parameters are established in Appendix E.

## Appendix B Preliminaries

### b.1 Graphex processes

Recall the setup for the theoretical results: we consider a very large population network with edges, and we study the graph-valued stochastic process given by taking each to be an -sample from and requiring these samples to cohere in the obvious way. We idealize the population size as infinite by taking the limit . The limiting stochastic process is well defined, and is called a graphex process Borgs:Chayes:Cohn:Veitch:2017.

Graphex processes have a convenient explicit representation in terms of (generalized) graphons Veitch:Roy:2015; Borgs:Chayes:Cohn:Holden:2016; Caron:Fox:2017.

###### Definition B.1.

A graphon is an integrable function .

###### Remark B.2.

This notion of graphon is somewhat more restricted than graphons (or graphexes) considered in full generality, but it suffices for our purposes and avoids some technical details.

We now describe the generative model for a graphex process with graphon . Informally, a graph is generated by (i) sampling a collection of vertices each with latent features , and (ii) randomly connecting each pair of vertices with probability dependent on the latent features. Let

be a Poisson (point) process on with intensity , where is the Lebesgue measure. Each atom of the point process is a candidate vertex of the sampled graph; the are interpreted as (real-valued) labels of the vertices, and the as latent features that explain the graph structure. Each pair of points with is then connected independently according to

 1[(ηi,ηj) connected]ind ∼ Bern(W(xi,xj)).

This procedure generates an infinite graph. To produce a finite sample of size , we restrict to the collection of edges . That is, we report the subgraph induced by restricting to vertices with label less than , and removing all vertices that do not connect to any edges in the subgraph. This last step is critical; in general there are an infinite number of points of the Poisson process such that , but only a finite number of them will connect to any edge in the induced subgraph.

Modeling as collected by -sampling is essentially equivalent to positing that is the graph structure of generated by some graphon . Strictly speaking, the -sampling model induces a slightly more general generative model that allows for both isolated edges that never interact with the main graph structure, and for infinite star structures; see Borgs:Chayes:Cohn:Veitch:2017. Throughout the appendix, we ignore this complication and assume that the dataset graph is generated by some graphon. It is straightforward but notationally cumbersome to extend this assumption to -sampling in full generality.

### b.2 Technical Background: Exchangeable Pairs

We will need to bound the deviation of the (normalized) degree of a vertex from its expectation. To that end, we briefly recall the method of exchangeable pairs; see Chaterjee:2005 for details.

###### Definition B.3.

A pair of real random variables is said to be exchangeable if

 (X,X′)d=(X′,X).

Let and be measurable function such that:

 E(F(X,X′)|X)a.s=f(X), and F(X,X′)=−F(X′,X).

Let

 v(X)≜12E((f(X)−f(X′))F(X,X′)∣∣X),

and suppose that for some . Then

 ∀x>0, P(|f(X)−E(f(X))|≥x)≤2e−x22C.

Further, for all and it holds that:

 P(|f(X)−E(f(X))|>x)≤(2p−1)p∥v(X)|∥ppxp.

### b.3 Notation

For convenient reference, we include a glossary of important notation.

First, notation to refer to important graph properties:

• is the latent Poisson process that defines the graphex process in Section B.1. The labels are and the latent variables are .

• is the restriction of the Poisson process to atoms with labels in .

• To build the graph from the point of process we need to introduce a process of independent uniform variables. Let

 UΠ≜(Uηi,ηj)ηi,ηj∈Π

be such that is an independent process where

• is the (random) edge set of the graphex process at size .

• is the set of vertices of .

• is all pairs of points in that are not connected by an edge.

• The number of edges in the graph is

• The neighbours of in are

 Nn(η)≜{η′ : (η,η′)∈P1(Γn)}
• For all , the set of paths of length in is

 Pk(Γn)≜{(ηi)i≤k+1∈V(Γn)k+1 :  (ηi,ηi+1)∈Γn ∀i≤k}.
• The degree of in is .

• Asymptotically, the number of edges of a graphex process scales as Borgs:Chayes:Cohn:Holden:2016. Let be the proportionality constant

 E≜limn→∞Enn2.

Next, we introduce notation relating to model parameters. Treating the embedding parameters requires some care. The collection of vertices of the graph is a random quantity, and so the embedding parameters must also be modelled as random. For graphex processes, this means the embedding parameters depend on the latent Poisson process used in the generative model. To phrase a theoretical result, it is necessary to assume something about the structure of the dependence. The choice we make here is: the embedding parameters are taken to be markings of the Poisson process . In words, the embedding parameter of a vertex may depend on the (possibly latent) properties of that vertex, but the embeddings are independent of everything else.

• The collection of all possible parameters is:

 ΩΠθ≜{(λη,γ)η∈Π : λη∈Ωθ ∀η∈Π and γ∈Ωγ}.

Note that we attach a copy of the global parameter to each vertex for mathematical convenience.

• For all , let denote the projection on and let denote the projection on

• The following concepts and notations are needed to build a marking of the Poisson process: Let be a distributional kernel on . We generate the marks according to a distribution on , conditional on , such that if then:

• is an independent process

• for all

• Let the augmented object that carries information about both the graph structure () and the model parameters .

## Appendix C Results for p-sampling

We begin by establishing the result for -sampling, with the negative examples chosen according to the induced subgraph. This is the simplest case, and is useful for the introduction of ideas and notation. We consider more general approaches to negative sampling in the next section, where it is treated in tandem with random walk sampling. The same arguments can be used to extend -sampling to allow for, e.g., unigram negative sampling used in our experiments.

For all , let be the graph where the vertices are annotated with their embeddings from . We write the empirical risk as .

###### Theorem C.1.

Let a random variable taking value in such that , for a certain kernel , then there is some constant such that if then

 ^R(Γn(¯θ))→cpsm

both a.s. and in , as .

Moreover there is some constant such that

 minθ^R(Γn(θ))→cps∗

both a.s.and in , as .

###### Proof.

We will first prove the first statement. Let , let be the edge set of , and let be the partially labeled graph obtained from by forgetting all labels in (but keeping larger labels and the embeddings ). Let be the -field generated by . A key observation is

 ^Rk(Γn(¯θ))=E[L(Γk,Γ(¯θ)|k)∣Fn(¯θ)]. (9)

The reason is that choosing a graph by -sampling is equivalent uniformly relabeling the vertices in and restricting to labels less than ; averaging over this random relabeling operation is precisely the expectation on the righthand side.

By the reverse martingale convergence theorem we get that:

but as is a trivial sigma-algebra we get the desired result.

We will now prove the second statement. Let be the partially labeled graph obtained from by forgetting all labels in and let be the -field generated by . Further, we denote the set of embeddings of the graph by:

 ΩΓmθ≜{(λV,γ)V∈Γm:∀V∈V(Γm) λV∈Ωλ,γ∈Ωγ}.

We are now ready to state the proof. Let , and observe that:

 E[minθ∈ΩΓnθ^Rk(Γn(θ))∣Fm] ≤minθ∈ΩΓmθE[L(Γk,Γ(θ)|k)∣Fm] (10) =minθ∈ΩΓmθ^Rk(Γm(θ)). (11)

Thus, is a supermartingale with respect to the filtration . Moreover, by assumption, the loss is bounded and thus so also is the empirical risk. Supermartingale convergence then establishes that converges almost surely and in to some random variable that is measureable with respect to . The proof is completed by the fact that is trivial.∎

## Appendix D Random-walk sampling

In this section we establish the convergence of the relational empirical risk defined by the random walk. The argument proceeds as follows: We first recast the subsampling algorithm as a random probability measure, measurable with respect to the dataset graph . Producing a graph according to the sampling algorithm is the same as drawing a graph according to the random measure. Establishing that the relational empirical risk converges then amounts to establishing that expectations with respect to this random measure converge; this is the content of Theorem D.8. To establish this result, we show in Lemma D.6 that sampling from the random-walk random measure is asymptotically equivalent to a simpler sampling procedure that depends only on the properties of the graphex process and not on the details of the dataset. We allow for very general negative sampling distributions in this result; we show that how to specialize to the important case of (a power of) the unigram distribution in Lemma D.7.

### d.1 Random-walk Notation

We begin with a formal description of the subsampling procedure that defines the relational empirical risk. We will work with random subset of the Poisson process ; these translate to random subgraphs of in the obvious way. Namely, if the sampler selects in the Poisson process, then it selects in .

Sampling follows a two stage procedure: we choose a random walk, and then augment this random walk with additional vertices—this is the negative-sampling step. The following introduces much of the additional notation we require for this section.

###### Definition D.1 (Random-walk sampler).

Let be a (random) probability measure over . Let be a sequence of vertices sampled according to:

1. (random-walk) and let for .

2. (augmentation) be a sequence of additional vertices sampled from independently from each other and also from .

Let be the vertex induced subgraph of . Let be the random probability distribution over subgraphs induced by this sampling scheme. Finally, let denote the same subgraph augmented with the embeddings.

With this notation in hand, We rewrite the loss function and the risk in a mathematically convenient form

###### Definition D.2 (Loss and risk).

The loss on a subsample is

 L(GH,GH(¯θ))∈[0,1].

The empirical risk is

 EPn[L(GH(λ),GH(¯θ))|¯Πn(¯θ)].
###### Remark D.3.

Note that the subgraphs produced by the sampling algorithm explicitly include all edges and non-edges of the graph. However, the loss may (and generally will) depend on only a subset of the pairs. In this fashion, we allow for the practically necessary division between negative and positive examples. Skipgram augmentation can be handled with the same strategy.

We impose a technical condition on the distribution that the additional vertices are drawn from. Intuitively, the condition is that the distribution is not too sensitive to details of the dataset in the large data limit.

###### Definition D.4 (Augmentation distribution).

We say is an asymptotically exchangeable augmentation distribution if is there is a such that

• There is a deterministic function such that

• where

Lemma D.7 establishes that the unigram distribution respects these conditions.

### d.2 Technical lemmas

We begin with some technical inequalities controlling sums over the latent Poisson process. To interpret the theorem, note that the degree of a vertex with latent property is given by in the theorem statement.

###### Lemma D.5.

Let be such that is distributed as a process of independent uniforms in and let

 fn(y,Π)≜∑η∈ΠnI(Ux(η)≤W(y,x)),

for all . Then the following hold:

1. such that , there are such that ,

 P(∣∣fn(y,Π)nW(y,⋅)−1∣∣≥β)≤Kn3βp.
2. such that

 P(∣∣fn(y,Π))n−W(y,⋅)∣∣≥β)≤Kpnpβ2p

and

 P(∣∣Enn2E−1∣∣≥β)≤Kpnpβ2p.
3. such that such that then

###### Proof.

We will first write the proof of the first statement, which is harder. We then highlight the differences in the other cases. We use the Stein exchangeable pair method, recalled in Section B.2.

Let be such that

 ∀x,y F(x,y)=[x−y].

Let and let

 Π′=T[¯J,¯J+1],[n,n+1]⋅Πν×Πx,

where is the permutation of and and

 T[¯J,¯J+1],[n,n+1]⋅Πν×Πx≜{(T[¯J,¯J+1],[n,n+1](ν),x), ∀(ν,x)∈Π}

Then we can check the following:

• As we obtain that

 E(fn(y,Π)W(y,⋅)−fn(y,Π′)W(y,⋅)∣∣Πn)(a)=1nW(y,⋅)[n−1∑j=0∑Πj+1∖ΠjI(Ux(η)≤W(y,x))−E(I(Ux(η)≤W(y,x)))](b)=fn(y,Π)nW(y,⋅)−1

where (a) is obtained by complete independence of and where to get (b) we use the fact that (see Veitch:Roy:2015)

 ∑(ν,x)∈Πj+1∖ΠjI(Ux(η)≤W(y,x))∼Poi(W(y,⋅))
• Moreover, we can very similarly see that:

where is a constant that does not depend on or .

Therefore using the exchangeable pair method presented earlier and setting for all such that we get that there is , such that for all

 P(|∑(ν,x)∈ΠnI(