Choosing to grow a graph:
Modeling network formation as discrete choice
Abstract.
We provide a framework for modeling social network formation through conditional multinomial logit models from discrete choice and random utility theory, in which each new edge is viewed as a “choice” made by a node to connect to another node, based on (generic) features of the other nodes available to make a connection. This perspective on network formation unifies existing models such as preferential attachment, triadic closure, and node fitness, which are all special cases, and thereby provides a flexible means for conceptualizing, estimating, and comparing models. The lens of discrete choice theory also provides several new tools for analyzing social network formation; for example, mixtures of existing models can be estimated by adapting known expectationmaximization algorithms, and the significance of node features can be evaluated in a statistically rigorous manner. We demonstrate the flexibility of our framework through examples that analyze a number of synthetic and realworld datasets. For example, we provide rigorous methods for estimating preferential attachment models and show how to separate the effects of preferential attachment and triadic closure. Nonparametric estimates of the importance of degree show a highly linear trend, and we expose the importance of looking carefully at nodes with degree zero. Examining the formation of a large citation graph, we find evidence for an increased role of degree when accounting for age.
1. Introduction
Understanding how networks form and evolve is an essential component of understanding their structure, which in turn underlies the basis for understanding the broad range of processes that occur on networks. Models of social network formation can largely be decomposed into node formation and edge formation. In this work, we argue that edge formation can be effectively modeled as a choice made by an actor (or actors) in the network to instantiate a connection to another node.
The diverse research on network formation has led to many models and mechanisms of edge formation, including preferential attachment (Albert and Barabási, 1999), uniform attachment (Callaway et al., 2001), triadic closure (Jin et al., 2001), random walks (Vázquez, 2003; Saramäki and Kaski, 2004), homophily (Papadopoulos et al., 2012), copying edges from existing nodes (Kumar et al., 1999, 2000), latent space structures (Hoff et al., 2002; Leskovec et al., 2007; Papadopoulos et al., 2012), inherent node fitness (Bianconi and Barabási, 2001; Caldarelli et al., 2002), and combinations of all of these (Liu et al., 2002; Jackson and Rogers, 2007; Leskovec et al., 2008). Here, we frame edge formation as a discrete choice process and derive a family of discrete choice models (McFadden et al., 1973; Train, 2009) that subsume a wide range of existing models in a unified framework and also naturally opens up a host of powerful extensions.
Discrete choice models are commonly employed in economics, social psychology, and statistics as a way to model how individuals make choices from a slate of discrete alternatives (Agresti, 2003). Typically, the alternatives have associated features, and statistical models of discrete choice make it possible to estimate the relative importance of such features. Such models have been used to answer questions such as how consumers choose goods (Simonson and Tversky, 1992), how people choose where they live (McFadden, 1978), how students choose what college to attend (Fuller et al., 1982), and how commuters choose between different modes of transportation (Train and McFadden, 1978). Discrete choice analysis is also used to understand how choices vary depending on the context in which they are framed: in online commerce, this could be how web layouts lead to different purchasing priorities (Ieong et al., 2012); for choosing colleges, this could be incorporating the effect of the national economy. In this paper, we demonstrate how discrete choice models can similarly help us understand the factors driving social network evolution.
The starting point for the present work is the observation that edge formation events in social networks are naturally viewed as discrete choices. For simplicity, consider a directed graph where edges are formed one by one, where we can think of the formation of a directed edge as “choosing” to connect with , where the set of alternatives available to is the set of all other nodes. (While undirected graph models are common in social network analysis, the underlying formation procedure is almost always asymmetric. For example, the Facebook friendship graph is typically modeled as an undirected graph (Ugander et al., 2011), but the friendships are proposed by one of the two nodes in an edge). The key modeling question is easy to state: why did choose ? This question has long been the informal subject of network formation modeling and at the same time the exact question that discrete choice models and analysis have been designed to answer. However, up to this point, network formation models have largely been decoupled from discrete choice theory.
In employing discrete choice analysis, we focus on the conditional multinomial logit model, commonly called the conditional logit model for short, which is a foundational workhorse of discrete choice modeling. The model belongs to the family of random utility models, where choices are interpretable as those of a rational actor selecting the alternative with the largest “utility” sampled from random variables that decompose into the inherent utility of the alternative and a noise term. With the conditional logit model, we can use existing optimization routines to estimate model parameters and existing statistical methods to asses the uncertainty of the estimates. Discrete choice models can also easily restrict the set of available alternatives, where it might not be reasonable to assume that the entire set of nodes is available for friendship. For example, sometimes only “friends of friends” are considered (Jackson and Rogers, 2007; Leskovec et al., 2008).
In this paper, we first show that many popular network formation mechanisms can be rewritten as conditional logit models, including preferential attachment, uniform attachment, node fitness, latent space models, and models of homophily. However, the real power of discrete choice models for social network analysis is the ability to combine different features (e.g., node degree and node age), as well as different mechanisms (e.g., triadic closure and preferential attachment) and estimate their relative roles. Social networks are enormously varied in their structure (Ikehara and Clauset, 2017), but existing methods often do a poor job at modeling this diversity. Thus, beyond unifying the network formation and discrete choice literature, we also develop several new tools for social network analysis. For example, we show how to estimate models to distinguish the effects of preferential attachment and triadic closure. We demonstrate these tools by analyzing the formation of the Flickr social network and the formation of a citation network. We find on Flickr that accounting for triadic closure greatly reduces the estimated role of degree in choosing who to connect to, and that nodes with degree zero have a remarkably high utility. On the citation network, we discern the relative roles of the number of previous citations compared to the age of a paper. Our estimates of preferential attachment in the citation network are similar to those observed in prior studies. When accounting for the age of a paper, we find evidence for linear preferential attachment. However, for a fixed degree, we find that age is negatively correlated with the likelihood of a new citation (i.e., older papers are less likely to be cited).
The key assumption underlying our framework is that the available data actually captures edge formation events (either through edge timestamps or other sequential information). In contrast, many existing approaches to understanding network formation focus on observing only the structural properties of a network at a single point of observation, e.g., its degree distribution, and initiating a deductive process to try and understand how variations in edge formation would lead to different outcomes (Albert and Barabási, 1999; Bianconi and Barabási, 2001; Liu et al., 2002; Jackson and Rogers, 2007). This approach leads to tidy analyses and easytocharacterize asymptotic properties, but model selection in this context is strongly dependent on what properties are compared. Different underlying formation processes can lead to graphs with indistinguishable properties. For example, many different formation processes result in the same heavytailed degree distributions (Mitzenmacher, 2003). Thus, when “fitting” outcome measurements in this way, one has to know (or posit), e.g., the relative rates of node formation and edge formation. However, when temporal or sequential data is available (Holme and Saramäki, 2012; Paranjape et al., 2017), our framework overcomes these limitations by incorporating this structure.
2. Discrete choice and edge formation
We now develop network formation through the lens of discrete choice. Throughout this paper, we assume that the networks are directed. Again, while undirected graphs are common in social network analysis, the actual edge formation process often has directed initiation. In the common setting of “growing graphs,” nodes arrive one at a time and form edges when arriving in a network. In these cases, the newly arriving node is considered to be the node initiating the connection; such analysis is standard with, e.g., classical preferential attachment models (Albert and Barabási, 1999).
When modeling the directed formation of an edge , two processes need to be distinguished, roughly corresponding to the questions “who is ?” (the chooser) and “who is ?” (the chosen). In this paper, we focus on understanding the latter, i.e., the formation of as the selection of conditional on knowing that is ready to form an edge. Thus, our discrete choice models of edge formation can be readily estimated from data that implicitly or explicitly contains a record of initiating nodes and used for subsequent analysis, as we show in Sections 3 and 4. Beyond the scope of this work, our model of “ conditional on ” can be paired with a model of “initiations by ” for a full generative model of network formation.
2.1. Background on discrete choice models
We now review discrete choice models generically, which we will then translate to the context of edge formation in Section 2.2. Consider a universe of alternatives and a dataset consisting of different choices, indexed by . Each choice consists of choice set and a chosen item . The elements in are mutually exclusive choice alternatives, and exactly one element from is chosen. We consider each element to be represented by a vector of features (for example, in our analysis in Section 2.2, will be the feature vector of a node in the graph, and will represent a set of nodes). We let denote the choice data.
A broad family of discrete choice models is the family of random utility models (RUMs), of which the conditional logit is an important special case. Each alternative has some inherent utility to the agent making the choice; with the conditional logit, we model this utility as a linear function of ’s features :
for some (latent) parameter vector that is fixed across individuals . Random utility models assume that agents make “rational” choices by maximizing random utilities centered on these inherent utilities. More formally, the utility observed by the actor is given by , where is a noise term. The probability of choosing option from the choice set is
When the are i.i.d. standard Gumbel,^{1}^{1}1Formally, a standard Gumbel distribution is a generalized extreme value distribution with a cumulative density function of the form . the probability of choosing each alternative is proportional to the exponentiated inherent utility (Agresti, 2003):
(1) 
The above model is the conditional multinomial logit model of discrete choice, though it is often referred to simply as the conditional logit or multinomial logit (MNL) model. If the noise terms are distributed i.i.d. Normal, then the model is the independent multinomial probit. In probit models, the choice probabilities are no longer proportional to utilities as in Equation (1). The Gumbel assumption is common in discrete choice theory and will facilitate our connections to a variety of network formation mechanisms. There are more complex random utility models that can impose dependence between noise terms (Train, 2009) and there also is a growing literature of flexible choice models (Tversky, 1972; Benson et al., 2016; Ragain and Ugander, 2016) designed to model context effects and other varied violations of the independence of irrelevant alternatives (Luce, 1959) (a storied axiom satisfied by the conditional multinomial logit model). We leave the relationships of these models to network formation as avenues for future research.
2.2. Edge formation as discrete choice
With the above formalisms in place, we now develop network formation from a discrete choice perspective. We begin by showing how several wellknown models can be conveniently expressed as conditional logit models, with a summary given in Table 1. All models are designed to grow simple graphs (i.e., without multiedges), and the choice set excludes any nodes to which the chooser is already connected. Every item is represented by its features that, importantly, can evolve over time. The features of node at time are thus always timeindexed, but we often suppress the to reduce notational clutter.
Preferential attachment. We start with the generalized BarabásiAlbert model (Albert and Barabási, 1999; Krapivsky et al., 2000; Bollobás and Riordan, 2003), also known as the generalized Price model (Price, 1976), one of the most studied models in the network formation literature. It is typically stated as a growth model of a timeevolving graph , , and when a new node arrives it connects to distinct existing nodes with a probability proportional to a power of their degree at time ,
(2) 
The exponent parameter controls the relative importance of degree (Krapivsky et al., 2000). The case where is called linear preferential attachment, and produces networks that can mimic a range of structural properties observed in empirical networks. If we represent each potential neighbor with the timeindexed onedimensional “feature vector” and employ a conditional logit model as in Equation (1), we obtain a utility of for at time of . Here the choice model parameter plays the exact role of , since .
Given a growing network , we can construct a choice dataset from this network by extracting the node , node sets , and degree sequence at each timestep. The preferential attachment model has only one parameter, . The loglikelihood for that parameter given a dataset is then:
We’ve suppressed the timeindex from the features to reduce clutter, but emphasize that is the degree at the time of the choice.
Nonparametric preferential attachment. The above model assumes an attachment kernel of a particular parametric form. From a discrete choice perspective, one can also estimate the role of degree in edge formation nonparametrically by estimating a coefficient for each degree individually. This approach has the added benefit of being able to assign positive probability to choosing nodes with degree zero. Under this model, the loglikelihood of the parameters given the dataset is:
Again we’ve suppressed timeindexing to simplify the presentation. Pham et al. (Pham et al., 2015) previously described a version of the above likelihood as a means of measuring the attachment kernel using maximum likelihood, albeit without making the connection to discrete choice.
Process  
Uniform attachment (Callaway et al., 2001)  
Preferential attachment (Albert and Barabási, 1999; Krapivsky et al., 2000)  
Nonparametric PA (Newman, 2001; Redner, 2005; Pham et al., 2015)  
Triadic closure (Rapoport, 1953)  1  
FoF attachment (Jin et al., 2001; Vázquez, 2003; Saramäki and Kaski, 2004)  
PA, FoFs only  
Individual node fitness (Caldarelli et al., 2002)  
Latent space (Hoff et al., 2002; Leskovec et al., 2007; Papadopoulos et al., 2012)  
Stochastic block model (Karrer and Newman, 2011)  
Homophily (McPherson et al., 2001) 
Uniform attachment. A simple edge formation model is to sample a new neighbor uniformly at random from all nodes (Callaway et al., 2001). There are no parameters in this model, but we can still write down the likelihood of the model given a dataset, which will be useful when we later combine this model with others within a mixture model:
Triadic closure. A variant of uniform attachment is for to attach to new neighbors uniformly at random from the set of their friendsoffriends, as opposed to the set of all nodes. This process effectively models triadic closure (Rapoport, 1953). It has the same simple functional form of the uniform model, but now the choice set varies with each choice, namely, the choice set is restricted to be only the friends of friends of node (the chooser) to which is not already connected. This change in choice set can also be achieved by assuming the utility of to at time is , where is a boolean indicating whether and are friends of friends at time , and then letting the choice set revert to the full node set.
An additional model that naturally combines the ideas of preferential attachment and befriending friendsoffriends takes the number of friends in common between and as a feature. We could define this feature as , where indicates whether there is an edge between and at time . The corresponding utility would be . This model is similar (but not equivalent) to random walkbased formation models (Jin et al., 2001; Vázquez, 2003; Saramäki and Kaski, 2004), which emphasize formation within a local neighborhood.
Node fitness. Another line of formation models that is subsumed by the discrete choice framework are those involving fitness. In this work, nodes choose to connect to others based on some intrinsic latent fitness score. Certain distributions of fitness values lead to a scalefree degree distribution (Caldarelli et al., 2002), providing an alternative explanation to preferential attachment for modeling such degree distributions. We can express the node fitness model by a conditional logit model with separate fixed effect for each node (so the feature of a node is an indicator vector of its identity). The likelihood of the fitness parameters given the data is then:
This formation model is equivalent to the classic BradleyTerryLuce model of discrete choice for estimating the quality of alternatives (Luce, 1959). Alternatively, one could replace the individual fixed effects with surrogate features of node fitness such as an auxiliary measure of gregariousness (in the case of social networks), or the impact factor of a paper’s journal (in the case of citations networks).
Latent space models. Another class of network formation models postulates the existence of a latent space that drives connections between nodes. Examples of latent spaces include Euclidean space (Hoff et al., 2002), hyperbolic space (Krioukov et al., 2010), a tree (Leskovec et al., 2007), a circle (Papadopoulos et al., 2012), or a set of discrete classes (Holland et al., 1983). While the conditional logit model in the form that we describe it does not facilitate finding the bestfitting latent space assignment to explain the data, it can be used to estimate the relative importance of a known latent space given a distance function . As one example from the family of latent space models, in the communityguided attachment (CGA) model (Leskovec et al., 2007) all nodes have a distance derived from the height of common parents in a latent tree structure situating all nodes and . Given this tree as known, a node connects to another proportionally to for some scalar . As a conditional logit model, the corresponding utility function is . The parameter vector can be retrieved by fitting a conditional logit with a known as the only variable and transforming the estimated parameter with . Assuming that the latent space representation is given is a strong assumption, and fitting such a model while estimating the latent space representation (e.g. as done by Hoff et al. (Hoff et al., 2002) in Euclidean space) is much more difficult.
Additional models. Conditional logit models are very flexible and can deal with multiple features and interactions between them. Any number of features can be added, including node covariates and structural features like a node’s clustering coefficient (Bagrow and Brockmann, 2013) or age (Callaway et al., 2001; Leskovec et al., 2008). Conditional multinomial logit models can also be used to investigate the role of homophily (McPherson et al., 2001) in edge formation, by adding a binary feature indicating whether nodes and are part of the same class.
Table 1 summarizes how several network formation models fit within the discrete choice framework via their corresponding utility functions and choice sets. A major advantage of this framework is that different features can easily be combined into a single model and jointly estimated. Or, when suitable, one can employ a mixture of conditional logit models, as we show in the next section.
2.3. Combining modes using Mixed Logit
So far we have written a range of existing and new edge formation models as conditional logit models, a specific type of discrete choice model. But several existing edge formation models that do not fit neatly into the conditional logit framework, meanwhile, align exactly with the use of mixture models in discrete choice modeling. Following our success formulating edge formation models as conditional logit models, in this subsection we develop mixed conditional logit formulations of several additional models.
A common proposal to make network formation models more flexible is to augment an existing model by allowing nodes to pick neighbors uniformly at random with some probability , while running the ordinary model with probability (Kumar et al., 1999, 2000; Liu et al., 2002; Cooper and Frieze, 2003). This augmentation increases flexibility because it enables the model to explain edge events that may otherwise have probability zero. Within discrete choice, this approach is precisely a mixed logit model where one of the mixture modes is uniform attachment.
While the conditional logit estimates a single parameter vector representing average preferences as shared by all agents, and the mixed logit model is often used to account for differences in preferences across various types of agents. In its most general form, the mixed logit is expressed using a probability distribution over different instantiations of the parameter vector :
In this work, we will only consider discrete mixtures of logits, also called a latent class model (Kamakura and Russell, 1989):
where and the weights model the relative prevalence of each mode.
Copy model. The copy model is a classic formation process that can be written as a mixed logit with two modes. In the first mode, new edges connect proportional to degree with probability , while in the second mode they connect uniformly at random with probability (Liu et al., 2002; Cooper and Frieze, 2003). As a conditional logit model, the utilities of the two modes are and , respectively, and the class probabilities are . (This is a special case of the original copy model where edges are copied from a sampled vertex (Kumar et al., 2000); the model here is when , which is often used for analysis (Easley and Kleinberg, 2010).) The connection between relaxations of preferential attachment and mixture models was also recently observed by Medina et al. (Medina et al., 2018).
Local search model. Another example of a model with multiple modes is the JacksonRogers model of edge formation as a mixture of uniform attachment and triadic closure (Jackson and Rogers, 2007). The original model is based on a relative rate between edges forming at random and edges formed locally. It also has edges form based on respective acceptance probabilities. We describe a simplified version of this model, which we’ll call the local search model, where edges connect to nodes selected uniformly at random from the full node set with probability and uniformly at random from the set of friendsoffriends with probability .^{2}^{2}2 Since the parameter in the original presentation is actually the rate of uniform attachment, we can relate it to our through . For example, if the rate between random and friendoffriend edges is one to one (), then . We can represent this simplified process with a twomode mixed logit model. In this case the mixture parameters are and both modes have the same utility function but their choice sets differ so that the second mode only considers friendsoffriends.^{3}^{3}3A model with a restricted choice set, for example to only friendsoffriends, gives a likelihood of zero to choices outside the choice set.
Table 2 overviews the mixture model formulations described above, as well as a new model—the model—that we use in Section 4.2 to analyze preferential attachment effects.
Process  Modes 
Copy model (Kumar et al., 1999)  Uniform, PA 
Node types (Kumar et al., 2010)  New node, PA, none 
Local search (Jackson and Rogers, 2007)  Uniform, Uniform FoF 
()model  Uniform, PA, Uniform FoF, PA FoF 
3. Estimation and Inference
To learn a discrete choice model of network formation from data, we assume that we have access to a sequence of directed edges, in chronological order. This sequence of edges needs to be recast as choice data in order to fit a choice model. For every formed edge , we create a data point consisting of the choice , the choice set of candidates nodes at the time, and the features of each candidate node at the time.
Given a data set and a conditional logit model, one can write out the loglikelihood, as shown in Section 2.2. For any conditional logit model with a linear utility , the likelihood function is convex with respect to the variables and can be efficiently maximized using standard gradientbased optimization (e.g., BFGS). The functional form of the logit leads to straightforward gradients. For example, for preferential attachment, the gradient is
where the timedependence of the features (degrees) have been suppressed to reduce clutter. Gradients for the other choice models in Section 2.2 are omitted but straightforward.
One advantage of likelihoodbased model fitting is that we can compute standard errors and confidence intervals of the parameters. In particular, the standard errors can be computed with (Train, 2009), where H is the Hessian matrix of second derivatives of the loglikelihood at the parameters .
Mixture models and expectationmaximization. For mixed conditional logit models, the loglikelihood is no longer convex in general, making optimization more difficult. To maximize the likelihood of mixed models we turn to expectation maximization (EM) techniques (Dempster et al., 1977; Train, 2008). We briefly summarize the procedure described in Train’s book (Train, 2009, Chapter 14.3.2). Assume that we have a model with modes (i.e., mixture components), where every mode starts with initial parameter values (usually initiated at 1). Choices are again indexed with , so that and . The EM algorithm runs through the following steps:

Initiate class probabilities uniformly with and initial class responsibilities for each data point.

For every data point and every mode , compute the class responsibility given by the relative individual likelihood:

For every mode , update the total class probability with .

For every mode , update the parameters using standard optimization for fitting a single model, weighing each choice set with its class responsibility .

Repeat steps 2–4 until some convergence or stopping criteria.
The total likelihood of the parameters and class probabilities is:
We monitor the convergence of the iterative procedure using the change in this total likelihood between iterations.
Even though EM is theoretically an efficient estimator (Xu and Jordan, 1996), there are cases when alternatives are appropriate. For example, if one has reasonable bounds or priors on the parameter values, then direct likelihood maximization could be used, and if the search space is lowdimensional, a grid search might be appropriate. Recent theoretical work has also developed algorithms for learning mixtures of two multinomial logit modes with theoretical guarantees assuming a separation between the modes (Chierichetti et al., 2018).
Negative sampling. Every time an edge is formed by some node , each node not yet connected to is a candidate choice. For large sparse graphs, the full choice set of all nodes can become large and the gradients of the loglikelihood expensive to compute. To speed up this computation, negative/nonchosen examples can be sampled uniformly at random to create a (random) reduced dataset with smaller choice sets. For each choice , one forms a smaller random choice set out of the positive choice and the negative samples, with , and replaces the original choice data with . As long negative examples are sampled uniformly at random, parameter estimates on a dataset with negatively sampled choice sets are unbiased and consistent for the estimates on the on the full set (McFadden, 1978; Train, 2009; Jarvis, 2018). Practically, there is a tradeoff between feature computation and storage on the one hand, and the ability to estimate coefficients for rare features on the other.
Typical likelihood surface. In Figure 1 we show the representative likelihood surface of a copy model to illustrate its properties. We generated a synthetic graph on nodes according to the copy model with edges per node and degreeattachment probability . We fit a twomode mixed logit model to this data with and . We use negative samples. There are two free parameters in this model: the degree exponent and the mixture probability . We plot the loglikelihood across a reasonable range of values to show that surface is generally well behaved. We see that it is hard to distinguish between data generated under a copy model () with probability from data generated from nomixture () preferential attachment with , and there is a general tradeoff between the exponent and the mixture probability .
Model comparison and the likelihoodratio test. Another advantage of our discrete choice framework is that we can employ standard statistical methods for model selection. Specifically, when one model is a special case of another, their relative quality can be compared using the likelihood ratio test. In the case of the conditional logit, a model with additional features can be compared to one without them because the latter is a special case of the former with the coefficients of the additional features being set to 0. Or, in the case of the mixed logit, one can define a model with multiple modes and manually set some of them to zero.
As a concrete example, suppose we wanted to know whether including the age of a node in a preferential attachment model results in a statistically significantly better model. To do so, we would first estimate the parameters of the more complex model, . We would then estimate the parameters of the simpler model . Let and be the likelihoods of the two models with parameters and . We can compute the likelihood ratio . Under the null hypothesis of the simpler model, with some regularity conditions, is asymptotically distributed (more generally, where is the number of additional degrees of freedom in the more complex model) (Wilks, 1938), and this is the standard test in the finite regime (Train, 2009, Chapter 3.8.2).
4. Applications
We now demonstrate how to use our conditional logit framework to analyze network formation processes. We first consider synthetic data and show how our tools can be used to better analyze preferential attachment mechanisms. We then analyze two empirical datasets that demonstrate how to integrate different structural features of the network or integrate node covariates. In both cases, our framework provides novel insights into the network formation processes. We provide code for processing data (converting ordered edge lists to choice data) and for model fitting (with negative sampling), available here: https://github.com/janovergoor/choose2grow/.
4.1. Measuring preferential attachment
The question of whether and when preferential attachment is an important driver of network formation is widely debated (Albert and Barabási, 1999; Callaway et al., 2001; Newman, 2001; Caldarelli et al., 2002; Newman, 2001; Vázquez, 2003; Saramäki and Kaski, 2004; Jackson and Rogers, 2007; Bagrow and Brockmann, 2013; Broido and Clauset, 2018). Most prior research focuses on estimating the shape of the attachment kernel by comparing the degree of chosen nodes to the distribution of available degrees (Newman, 2001; Jeong et al., 2003; Redner, 2005). However, recent work by Pham et al. shows that previous measures are biased (Pham et al., 2015). In particular, the bias comes from the assumption that the distribution of available nodes of varying degrees is constant throughout the formation process, but this distribution clearly changes as the network grows.
To estimate the exponent of an attachment kernel, Pham et al. propose fitting something akin to a conditional logit with a separate coefficient for each degree, and then estimating via a weighted least squares fit over the degree coefficients (Pham et al., 2015). Compared to this method, fitting a logdegree logit directly is much simpler. In fact, it is the maximum likelihood estimator for , and thus consistent and efficient.
To illustrate, we generate a graph with pure preferential attachment (, edges formed per node, ) and estimate the attachment kernel by the methods of Newman (Newman, 2001) and Pham et al. (Pham et al., 2015). The maximum degree of this graph was 102, and the results of the different estimation procedures are shown in Figure 2. The nonparametric estimates are similar for lower degrees, but for higher degrees the Newman measure incorrectly drops, illustrating the bias that Pham et al. have previously documented. Fitting directly using a logdegree conditional logit gives an estimate of . The Pham et al. least squares fit is , which is close to the MLE, but may deviate considerably in more difficult instances.
4.2. Disentangling preferential attachment from triadic closure
Many models exhibit similar outcomes to preferential attachment (Krapivsky et al., 2000; Caldarelli et al., 2002; Vázquez, 2003; Mitzenmacher, 2003; Jackson and Rogers, 2007), but there are few principled ways to rigorously test the relative validity of these models. In this section, we show how to use the discrete choice framework to estimate the relative importance of preferential attachment while accounting for other dynamics. To this end, we generate data according to a known generative process and fit various (possibly misspecified) formation models. Our generative process is a hybrid between the copy model of preferential attachment (i.e., choose nodes proportional to degree) and the JacksonRogers local search model (i.e., connecting to friendsoffriends). The process, which we call the model, is parametrized by and . When a new edge is formed, with probability it is formed uniformly at random and with probability it is formed with linear preferential attachment (). Meanwhile, the choice set is determined by the second parameter : with probability , the choice set is all nodes not yet connected to , while with probability , the choice set is limited to available friendsoffriends of . With , this model reduces to the copymodel. With , it reduces to the simplified local search model. Our model thus naturally subsumes two popular models of network formation in a single, simple discrete choice framework. For a growth process on directed graphs, it is necessary that and , otherwise new nodes will never be selected.
With this general model, we investigate how estimating parameters of one of the more specific models goes awry when the true data generating process in fact comes from an instance of the more general model. For a range of values of and , we generated graphs using the following growth process. New nodes arrive, each creating edges. For every edge, we sample the mode of the model (according to and ) independently. If an edge is supposed to be a friendoffriend edge, but no friendsoffriends are available (for example, ’s first edge), then the process reverts to uniformly random formation across the full node set.^{4}^{4}4This creates a slight bias towards uniform at random modes. This reversion to uniform attachment happens for every first edge with probability . Sweeping through combination of and parameter values, for each set of parameters we generated 10 undirected graphs with nodes each.
Degree distributions. The local search and copy models both produce graphs with powerlaw degree distributions. Therefore, fitting a misspecified model on a degree distribution can lead to misleading results. To illustrate, we fit a powerlaw distribution to the degree distribution of graphs generated from models using maximum likelihood estimation (Clauset et al., 2009), with estimates for in Figure 3. In theory, an undirected graph formed with the copy model process with probability parameter leads to a degree distribution with power law exponent (Mitzenmacher, 2003; Bollobás and Riordan, 2003) (for directed graphs, ). As increases, the degree distribution looks more like a random graph without preferential attachment. However, as goes down (increasing the relative role of friendoffriends), the parameter estimate looks like the estimates for the copy model, even when .
To summarize, it is not recommended to estimate a formation model from an observed degree distribution. The parameter estimates are sensitive to small deviations in the generative process.
Discrete choice modeling. Beyond degree distributions, in Figure 4 we look at how the two subsumed models (the copy model and the JacksonRogers local search model) fare when estimated from formation data generated by the model. We look at two cases. As a first case, we generate graphs with and , so half the edges are formed to friendsoffriends with no utility from degree. The likelihood under a local search model ( free, ) as a mixed logit is maximized at , while for the copy model (, free) it is maximized at . The former is a much better fit than the latter (value ¡ ), and the copy model erroneously thinks that preferential attachment is driving 45% of the edges. As a second case, we look at a graph generated with and , so half the edges are due to preferential attachment, and friendoffriending plays no role. In this case, both models are correctly maximized at their relative values. Again, the correct model has a higher likelihood (value ¡ ).
4.3. Choosing to follow on Flickr
We now apply our framework to examine a realworld network formation dataset capturing the growth of the Flickr social network. We find that incorporating a FriendofFriend feature beyond preferential attachment and linkreciprocation features substantially improves both likelihood and test accuracy and furthermore that the inclusion of this feature significantly reduces preference for degreebased attachment. However, omitting preferential attachment entirely leads to a worse model. We also find a preference for nodes with zero degree over low degree nodes. This hints that such nodes play a special role in the network formation process, even though they would be ignored in preferential attachment models.
Data. We use a scrape of the Flickr social network collected daily between October 2006 and May 2007 (Mislove et al., 2007, 2008). Users of Flickr can choose to follow other users and the “following” (but not the “followed by”) connections are publicly accessible. The data was gathered using a breadthfirst search crawl, which means that only the connected components reachable from the seed profiles are represented in the data. Since a full crawl was performed daily, the timing of new edges can be identified at the granularity of a day. The graph contains 3.2 million nodes and 33.1 million edges.
As described in the original papers, this data is consistent with both preferential attachment, as inferred from the indegree distribution, and local search, as inferred from the overrepresentation of edges to nodes that are close to the linking node (Mislove et al., 2008). Fitting a power law to the distribution of indegrees gives , which would indicate superlinear preferential attachment. We can test the relative importance of triadic closure by fitting a JacksonRogers model using the degree distribution matching procedure described in (Jackson and Rogers, 2007). This results in , estimating that three out of four edges are formed through triadic closure.
Model  
#1  #2  #3  #4  
log Followers  1.149***  0.715***  0.536***  
(0.007)  (0.009)  (0.010)  
Has degree  0.580***  0.631***  1.745***  
(0.202)  (0.190)  (0.234)  
Reciprocal  8.419***  8.347***  8.197***  7.903*** 
(0.220)  (0.220)  (0.240)  (0.244)  
Is FoF  6.125***  3.955***  
(0.045)  (0.050)  
2 Hops  6.290***  
(0.190)  
3 Hops  2.851***  
(0.185)  
4 Hops  0.583***  
(0.189)  
5 Hops  0.585***  
(0.218)  
6 Hops  1.122***  
(0.266)  
Observations  20,000  20,000  20,000  20,000 
Loglikelihood  16,448  14,685  10,728  9,789 
Test accuracy  0.758  0.722  0.853  0.855 
Note:  *p0.10; **p0.05; ***p0.01 
Discrete choice analysis. We fit a series of conditional logit models to further investigate the network formation process. We isolated a sample of 20,000 edge formation events occurring around the same date,^{5}^{5}5We enumerated edges starting November 5, 2006 and included new edges with probability 0.01 until reaching the desired sample size. We excluded edge events originating from nodes seen for the first time in a given day (the timing of these edge events are uncertain due to the original data collection process). The same analysis starting on March 3, 2007 led to virtually identical results. to avoid time heterogeneity affecting the estimates. We fit several models, displayed in Table 3. Notchosen alternatives are negatively sampled with . We logtransform indegree (representing the number of followers), but in order to account for nodes with degree zero, we add a “has degree” feature for having a positive degree and use a modified version of that returns 0 for input 0.^{6}^{6}6This is a better solution than using , or only giving degreezero nodes the same utility as degreeone nodes. Either of those solutions will give substantially different results, especially when there are many degreezero nodes. In the first column, we fit a model using just these two degreerelated features, and a reciprocity feature capturing whether the target node is already following the chooser. Reciprocity is a common phenomenon, with 60% of edges being followed back (Mislove et al., 2008). The estimate (the coefficient for “log Followers”) for this model is significantly larger than 1, again consistent with superlinear preferential attachment.
In the second model, we test the effect of the target node being a friendoffriend of the choosing node. In the case of Flickr, this means that the choosing user already follows someone that follows the target user, which evidently is strongly correlated with following that user. However, combining these two features in a third model (column 3) leads to both estimated parameters dropping substantially. Most remarkable is the 40% drop in the estimate of , which paints a very different picture about the role of degree.
In the fourth model, we measure network proximity as in the original paper, by counting the number of “hops” (path length) from to the target before an edge was made. We integrate the hops as categorical variables to show the relative impact of each additional “hop”. Being two hops away is equivalent to being a friendoffriend, and thus has strongly positive coefficient. Every additional hop corresponds to a sharp decrease in choosing that node. Being five hops away is slightly worse than there not being a path at all. This could be an artifact of the way the data was gathered, so that new regions of the graph only get “discovered” when there is at least one link to them, or this could be due to path length not being an accurate measure of distance for newer nodes. Since the number of hops is colinear with being a friendoffriend, we can’t test them both at the same time.
In Figure 5 we visually show the effect of different specifications on the estimate of . The first model of the Flickr data looks like superlinear preferential attachment, while the role of degree in the other two is significantly reduced. However, fitting a nonparametric model shows that the estimated coefficients for individual degrees are remarkably linear, suggesting that the functional form of is a good fit for this network. One important point is the role of zerodegree nodes. In most descriptions of preferential attachment, nodes with degree zero are not considered. However, in the Flickr data set, zerodegree nodes have a higher utility than positive low degree nodes, which could again be an artifact of the data collection process, or point to the special role of new nodes in the network. Either way, our framework allows one to find these kinds of patterns, and investigate them further.
4.4. Choosing to cite
We now turn to citation network data to show how a discrete choice framework facilitates the testing of network formation hypotheses. Previous analyses of citation networks have observed linear preferential attachment with respect to degree (Redner, 2005) and bias towards citing more recent work (Redner, 2005). Here, we find consistent results that older papers are less likely to be cited but that accounting for age actually increases the importance of degree (i.e., after accounting for age, higher degree nodes are more likely to be cited).
Data. We use the Microsoft Academic Graph^{7}^{7}7The Aminer Project (Tang et al., 2008; Sinha et al., 2015), https://aminer.org/openacademicgraph dataset and focus on a representative subgraph of 459,000 “Climatology” papers. We focus on the subgraph of a single field to simplify the analysis since citations are predominantly within the same field of study (our analysis was similar on other subgraphs). We construct a graph out of this data by adding an edge each time a paper in our dataset cites another paper in our dataset. For our analysis of Climatology publications, 45% of edges are within the domain and citations to papers that are not labeled are excluded, leaving 3 million edges. We sample 10,000 citation events uniformly at random from papers published after 2010 and apply negative sampling (). This processing results in 10,000 choices with 25 alternatives in each choice set. For each possible choice, we compute four features: the number of citations at the time of citation, whether the paper shares authors with the citing paper, the age of the paper in years at the time of citation, and the maximum number of publications by any one of the authors at the time of publication. This last feature is a proxy for node fitness (Caldarelli et al., 2002).
Model  
#1  #2  #3  #4  
log Citations  0.717***  0.794***  1.052***  1.044*** 
(0.008)  (0.010)  (0.012)  (0.012)  
Has degree  1.684***  1.677***  1.862***  1.830*** 
(0.053)  (0.062)  (0.063)  (0.064)  
Has same author  6.523***  5.928***  5.913***  
(0.110)  (0.114)  (0.114)  
log Age  1.096***  1.069***  
(0.018)  (0.021)  
Max papers by author  0.029***  
(0.011)  
Observations  10,000  10,000  10,000  10,000 
Loglikelihood  20,799  16,600  14,384  14,390 
Test accuracy  0.358  0.484  0.533  0.534 
Note:  *p0.1; **p0.05; ***p0.01 
Discrete choice analysis. We fit conditional logit choice models relating these features to the likelihood of citation (Table 4). The first model (first column) is a simple logdegree model. We find that the estimate (the coefficient for “log Citations”) is substantially lower than one, consistent with sublinear preferential attachment. Apart from the loglikelihood of the models, we also report the predictive accuracy (defined as the share of instances predicted correctly) on a holdout test set of 2,000 examples. Just relying on prior degree already gives an accuracy of 36%, which is high for a classification task with 25 classes. In model two (second column), we add a covariate for whether a paper shares an author with the citing paper. As expected, this has a strongly positive coefficient.
For the third model we add a covariate for the age of the paper in log years (years is always at least one). Older papers are less likely to get cited (accounting for degree), but accounting for age increases the relative importance of degree significantly. This expanded model also increases the accuracy to 53%, indicating that these feature weights do capture substantially more predictive power. Finally, in model four we add the “max papers by authors” feature as a proxy for fitness. The coefficient is small but positive. Accounting for fitness slightly reduces the estimated relative importance of degree, but the estimate is still close to 1. Adding this feature does not improve the loglikelihood or predictive accuracy; a better proxy for fitness may explain the data better.
Looking back to the visual display of for the citation models in Figure 5, the nonparametric coefficients are highly linear. In this data, zerodegree nodes are significantly less attractive than nodes with degree one.
As with any regression, the identifying causal effects from model fit depends on the design of the study. The estimates we provide here, as is the case with most analyses of observational data, are descriptive and not meant to describe causal processes. The point is that discrete choice models provide a flexible framework to easily test and compare different hypotheses around network formation.
5. Discussion
When modeling network formation, the majority of the literature analyzes networks that grow “externally,” with new nodes arriving and choosing who to connect to, and this setting has also been our main focus here. External growth leads to convenient models that are relatively easy to analyze, with citation networks and patent networks as examples of empirical networks that follow this generative process reasonably closely. However, in many (especially social) networks, pairs of older nodes often form edges as well, edges that are “internal” to the existing set of nodes. An extreme example is the social networks of schools or classrooms, which have a fixed node population and “grow” purely through an internal growth process. But a major advantage of modeling network formation as discrete choice is that it does not require any model of edge event initiation and simply conditions on the sequence of decisions to initiate, focusing the modeling on the choices made by the initiator. Discrete choice can therefore easily be used to model internal growth as well.
Another major advantage of discrete choice modeling is that it connects the analysis of largescale network datasets to statistical methods (fitting generalized linear models) that are tremendously scalable. As we show in this work, additional techniques (e.g. negative sampling) makes it possible to efficiently scale the estimation process to very large network datasets.
Since the conditional logit model of discrete choice is a random utility model, the estimated parameters can be interpreted as the marginal utility of each feature. This allows one to question the functional form of features. For example, we show that preferential attachment is equivalent to the logarithmic utility of degree. Given that degree is commonly heavytailed, this is a natural functional form, but we point out that the conditional logit allows one to flexibly compare different specifications.
The discrete choice perspective of network formation has implications for how network data is best collected and analyzed. It is useful to consider and record notions of directionality, even if edges can otherwise be considered to be undirected. Datasets that record the exact time of all edge formation events, as opposed to lumping edge events at the granularity of days or years, also makes it possible to analyze the formation process in more detail. It is also useful record information about the choice sets; what affects who ones could connect to, and what did each node look like at the time the choice was made?
There are a couple limitations to our proposed methodology. First, we cannot model purely undirected edges without any notion of direction. This can make it challenging to perform analysis when data collection is limited, as described above. Second, even though the conditional logit and mixed logit models allow one to model similar mechanisms, the interpretations of their estimates are different. The estimates of a conditional logit are more akin to those of a linear regression model, where one estimates the expected change in an outcome of varying a covariate. A mixture model is a probabilistic combination of constituent modes, so the class probabilities indicates the relative importance to each mode, which makes it harder to compare the roles of individual features within or across modes. However, many traditional models of network formation are equivalent to mixture models, which motivated our consideration of them in this work.
Additional related work. We have already covered a wide range of related work throughout the text. We include a few other important related ideas here. First, there is a strong connection to link prediction or missing data prediction that use network features to predict edges (LibenNowell and Kleinberg, 2007; Clauset et al., 2008). A network formation model implicitly makes claims about what edges are most likely to form next, and thus can be evaluated by the same metrics as link prediction algorithms (Lü and Zhou, 2011). We used predictive accuracy as a measure of goodness of fit, but our primary concern was interpretability of the model and estimates, which is one of the advantages of the conditional logit model.
Second, a related line of research studies stochastic actororiented models (Snijders, 2001; Snijders et al., 2010). This type of modeling combines network dynamics with node features to develop Markov chains in the space of graphs. These models are difficult to estimate, especially for larger datasets, and we are not aware of techniques such as negative sampling that can improve scalability.
Finally, estimating the parameters that drive edge formation is different from identifying the factors that could have lead to the observed graph. The latter question is often pursued with socalled exponential random graph models (ERGMs) (Wasserman and Pattison, 1996; Wiuf et al., 2006; Robins et al., 2007). However, these models do not consider individual edge events, are hard to estimate, and have known pathologies (Chatterjee and Diaconis, 2013; Shalizi and Rinaldo, 2013).
Looking forward. By making foundational connections between network formation modeling and discrete choice, we are hopeful that many further tools from discrete choice theory can be applied to the study of network formation. For example, there are cases of bias in network formation, e.g., men are more likely to cite themselves than women (King et al., 2017)—our discrete choice framework can help study these cases more rigorously. As another example, discrete choice models of subset selection (Fishburn and LaValle, 1996; Benson et al., 2018) could be applied to understand possible substitution and complementarity effects when choosing network connections. And discrete choice interpretations of machine learning embeddings techniques (Rudolph et al., 2016) can likely help unpack the behavior of recent embeddingbased network representation methods such as DeepWalk (Perozzi et al., 2014). Networks fundamentally represent interactions between discrete entities, and it is therefore natural that methods for modeling and analyzing discrete choice should enable many contributions.
Acknowledgements.
We thank Aaron Clauset, Daniel Larremore, and Eduardo LagunaMüggenburg for their helpful comments and feedback. ARB was supported by NSF Award DMS1830274.References
 (1)
 Agresti (2003) Alan Agresti. 2003. Categorical data analysis. Vol. 482. John Wiley & Sons.
 Albert and Barabási (1999) Réka Albert and AlbertLászló Barabási. 1999. Emergence of scaling in random networks. Science 286, 5439 (1999), 509–512.
 Bagrow and Brockmann (2013) James P Bagrow and Dirk Brockmann. 2013. Natural Emergence of Clusters and Bursts in Network Evolution. Physical Review X 3, 2 (2013).
 Benson et al. (2016) Austin R Benson, Ravi Kumar, and Andrew Tomkins. 2016. On the relevance of irrelevant alternatives. In WWW. ACM, 963–973.
 Benson et al. (2018) Austin R Benson, Ravi Kumar, and Andrew Tomkins. 2018. A Discrete Choice Model for Subset Selection. In WSDM. ACM, 37–45.
 Bianconi and Barabási (2001) Ginestra Bianconi and AL Barabási. 2001. Competition and multiscaling in evolving networks. EPL (Europhysics Letters) 54, 4 (2001), 436.
 Bollobás and Riordan (2003) Béla Bollobás and Oliver M Riordan. 2003. Mathematical results on scalefree random graphs. Handbook of graphs and networks: from the genome to the internet (2003), 1–34.
 Broido and Clauset (2018) Anna D Broido and Aaron Clauset. 2018. Scalefree networks are rare. arXiv preprint arXiv:1801.03400 (2018).
 Caldarelli et al. (2002) Guido Caldarelli, Andrea Capocci, Paolo De Los Rios, and Miguel A Munoz. 2002. Scalefree networks from varying vertex intrinsic fitness. Physical Review Letters 89, 25 (2002), 258702.
 Callaway et al. (2001) Duncan S Callaway, John E Hopcroft, Jon M Kleinberg, Steven H Strogatz, and Mark E J Newman. 2001. Are randomly grown graphs really random? Physical Review E 64, 4 (2001).
 Chatterjee and Diaconis (2013) Sourav Chatterjee and Persi Diaconis. 2013. Estimating and understanding exponential random graph models. Annals of Statistics 41, 5 (2013), 2428–2461.
 Chierichetti et al. (2018) Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. 2018. Learning a mixture of two multinomial logits. In ICML. PMLR, 961–969.
 Clauset et al. (2008) Aaron Clauset, Cristopher Moore, and Mark E J Newman. 2008. Hierarchical structure and the prediction of missing links in networks. Nature 453, 7191 (2008), 98–101.
 Clauset et al. (2009) Aaron Clauset, Cosma R Shalizi, and Mark E J Newman. 2009. PowerLaw Distributions in Empirical Data. SIAM Rev. 51, 4 (Dec. 2009), 661–703.
 Cooper and Frieze (2003) Colin Cooper and Alan Frieze. 2003. A general model of web graphs. Random Structures & Algorithms 22, 3 (2003), 311–335.
 Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (1977), 1–38.
 Easley and Kleinberg (2010) David Easley and Jon Kleinberg. 2010. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University Press.
 Fishburn and LaValle (1996) Peter C Fishburn and Irving H LaValle. 1996. Binary interactions and subset choice. European Journal of Operational Research 92, 1 (1996), 182–192.
 Fuller et al. (1982) Winship C Fuller, Charles F Manski, and David A Wise. 1982. New evidence on the economic determinants of postsecondary schooling choices. Journal of Human Resources (1982), 477–498.
 Hoff et al. (2002) Peter D Hoff, Adrian E Raftery, and Mark S Handcock. 2002. Latent space approaches to social network analysis. Journal of the american Statistical association 97, 460 (2002), 1090–1098.
 Holland et al. (1983) Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. 1983. Stochastic blockmodels: First steps. Social networks 5, 2 (1983), 109–137.
 Holme and Saramäki (2012) Petter Holme and Jari Saramäki. 2012. Temporal networks. Physics reports 519, 3 (2012), 97–125.
 Ieong et al. (2012) Samuel Ieong, Nina Mishra, and Or Sheffet. 2012. Predicting preference flips in commerce search. In ICML. PMLR, 1795–1802.
 Ikehara and Clauset (2017) Kansuke Ikehara and Aaron Clauset. 2017. Characterizing the structural diversity of complex networks across domains. arXiv preprint arXiv:1710.11304 (2017).
 Jackson and Rogers (2007) Matthew O Jackson and Brian W Rogers. 2007. Meeting Strangers and Friends of Friends: How Random Are Social Networks? American Economic Review 97, 3 (2007), 890–915.
 Jarvis (2018) Benjamin F Jarvis. 2018. Estimating Multinomial Logit Models with Samples of Alternatives. Sociological Methodology (2018).
 Jeong et al. (2003) H Jeong, Z Neda, and AlbertLászló Barabási. 2003. Measuring preferential attachment in evolving networks. Europhysics Letters (EPL) 61, 4 (2003), 567–572.
 Jin et al. (2001) Emily M Jin, Michelle Girvan, and Mark EJ Newman. 2001. Structure of growing social networks. Physical Review E 64, 4 (2001), 046132.
 Kamakura and Russell (1989) Wagner A Kamakura and Gary J Russell. 1989. A probabilistic choice model for market segmentation and elasticity structure. Journal of mMrketing Research (1989), 379–390.
 Karrer and Newman (2011) Brian Karrer and Mark EJ Newman. 2011. Stochastic blockmodels and community structure in networks. Physical review E 83, 1 (2011), 016107.
 King et al. (2017) Molly M King, Carl T Bergstrom, Shelley J Correll, Jennifer Jacquet, and Jevin D West. 2017. Men set their own cites high: Gender and selfcitation across fields and over time. Socius 3 (2017).
 Krapivsky et al. (2000) Paul L Krapivsky, Sidney Redner, and Francois Leyvraz. 2000. Connectivity of growing random networks. Physical review letters 85, 21 (2000), 4629.
 Krioukov et al. (2010) Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguná. 2010. Hyperbolic geometry of complex networks. Physical Review E 82, 3 (2010), 036106.
 Kumar et al. (2010) Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2010. Structure and evolution of online social networks. In Link mining: models, algorithms, and applications. Springer, 337–357.
 Kumar et al. (2000) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew Tomkins, and Eli Upfal. 2000. Stochastic models for the web graph. In Proceedings of the 42st Annual Symposium on Foundations of Computer Science. IEEE, 57–65.
 Kumar et al. (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. Extracting largescale knowledge bases from the web. VLDB (1999).
 Leskovec et al. (2008) Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. 2008. Microscopic evolution of social networks. In KDD. ACM, 462–470.
 Leskovec et al. (2007) Jure Leskovec, Jon M Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. Transactions on Knowledge Discovery from Data (TKDD) 1, 1 (2007).
 LibenNowell and Kleinberg (2007) David LibenNowell and Jon Kleinberg. 2007. The linkprediction problem for social networks. Journal of the American society for information science and technology 58, 7 (2007), 1019–1031.
 Liu et al. (2002) Zonghua Liu, YingCheng Lai, Nong Ye, and Partha Dasgupta. 2002. Connectivity distribution and attack tolerance of general networks with both preferential and random attachments. Physics Letters A 303, 56 (2002), 337–344.
 Lü and Zhou (2011) Linyuan Lü and Tao Zhou. 2011. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications 390, 6 (2011), 1150–1170.
 Luce (1959) R Duncan Luce. 1959. Individual Choice Behavior; a Theoretical Analysis. New York: Wiley.
 McFadden (1978) Daniel McFadden. 1978. Modeling the choice of residential location. Transportation Research Record 673 (1978).
 McFadden et al. (1973) Daniel McFadden et al. 1973. Conditional logit analysis of qualitative choice behavior. (1973).
 McPherson et al. (2001) Miller McPherson, Lynn SmithLovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 32, 1 (2001), 19–28.
 Medina et al. (2018) Jan Medina, Jorge Finke, and Camilo Rocha. 2018. Estimating Formation Mechanisms and Degree Distributions in Mixed Attachment Networks. arXiv.org (2018). arXiv:math.PR/1809.03372v1
 Mislove et al. (2008) Alan Mislove, Hema Swetha Koppula, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2008. Growth of the Flickr Social Network. In Proceedings of the 1st SIGCOMM Workshop on Social Networks.
 Mislove et al. (2007) Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th SIGCOMM conference on Internet Measurement. ACM.
 Mitzenmacher (2003) Michael Mitzenmacher. 2003. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics 1, 2 (2003), 226–251.
 Newman (2001) Mark E J Newman. 2001. Clustering and preferential attachment in growing networks. Physical Review E 64, 2 (2001), 440–4.
 Papadopoulos et al. (2012) Fragkiskos Papadopoulos, Maksim Kitsak, M Ángeles Serrano, Marián Boguná, and Dmitri Krioukov. 2012. Popularity versus similarity in growing networks. Nature 489, 7417 (2012), 537.
 Paranjape et al. (2017) Ashwin Paranjape, Austin R Benson, and Jure Leskovec. 2017. Motifs in temporal networks. In WSDM. ACM, 601–610.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD. ACM, 701–710.
 Pham et al. (2015) Thong Pham, Paul Sheridan, and Hidetoshi Shimodaira. 2015. PAFit: A Statistical Method for Measuring Preferential Attachment in Temporal Complex Networks. PLoS ONE 10, 9 (2015).
 Price (1976) Derek de Solla Price. 1976. A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science 27, 5 (1976), 292–306.
 Ragain and Ugander (2016) Stephen Ragain and Johan Ugander. 2016. Pairwise choice Markov chains. In NIPS. 3198–3206.
 Rapoport (1953) Anatol Rapoport. 1953. Spread of information through a population with sociostructural bias: I. Assumption of transitivity. The bulletin of mathematical biophysics 15, 4 (1953), 523–533.
 Redner (2005) Sidney Redner. 2005. Citation Statistics from 110 Years of Physical Review. Physics Today 58, 12 (2005).
 Robins et al. (2007) Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. 2007. An introduction to exponential random graph (p*) models for social networks. Social networks 29, 2 (2007), 173–191.
 Rudolph et al. (2016) Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. 2016. Exponential family embeddings. In NIPS. 478–486.
 Saramäki and Kaski (2004) Jari Saramäki and Kimmo Kaski. 2004. Scalefree networks generated by random walkers. Physica A: Statistical Mechanics and its Applications 341 (2004), 80–86.
 Shalizi and Rinaldo (2013) Cosma R Shalizi and Alessandro Rinaldo. 2013. Consistency under sampling of exponential random graph models. Annals of Statistics 41, 2 (2013), 508–535.
 Simonson and Tversky (1992) Itamar Simonson and Amos Tversky. 1992. Choice in context: Tradeoff contrast and extremeness aversion. Journal of Marketing Research 29, 3 (1992), 281.
 Sinha et al. (2015) Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bojune Paul Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In WWW. ACM, 243–246.
 Snijders (2001) Tom AB Snijders. 2001. The Statistical Evaluation of Social Network Dynamics. Sociological Methodology 31, 1 (2001), 361–395.
 Snijders et al. (2010) Tom AB Snijders, Gerhard G Van de Bunt, and Christian EG Steglich. 2010. Introduction to stochastic actorbased models for network dynamics. Social networks 32, 1 (2010), 44–60.
 Tang et al. (2008) Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In KDD. ACM, 990–998.
 Train (2008) Kenneth E Train. 2008. EM algorithms for nonparametric estimation of mixing distributions. Journal of Choice Modelling 1, 1 (2008), 40–69.
 Train (2009) Kenneth E Train. 2009. Discrete choice methods with simulation. Cambridge university press.
 Train and McFadden (1978) Kenneth E Train and Daniel McFadden. 1978. The goods/leisure tradeoff and disaggregate work trip mode choice models. Transportation research 12, 5 (1978), 349–353.
 Tversky (1972) Amos Tversky. 1972. Elimination by aspects: A theory of choice. Psychological Review 79, 4 (1972), 281.
 Ugander et al. (2011) Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. 2011. The anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503 (2011).
 Vázquez (2003) Alexei Vázquez. 2003. Growing network with local rules: Preferential attachment, clustering hierarchy, and degree correlations. Physical Review E 67, 5 (2003).
 Wasserman and Pattison (1996) Stanley Wasserman and Philippa Pattison. 1996. Logit models and logistic regressions for social networks: I. An introduction to Markov graphs and p*. Psychometrika 61, 3 (1996), 401–425.
 Wilks (1938) Samuel S Wilks. 1938. The largesample distribution of the likelihood ratio for testing composite hypotheses. Annals of Mathematical Statistics 9, 1 (1938), 60–62.
 Wiuf et al. (2006) Carsten Wiuf, Markus Brameier, Oskar Hagberg, and Michael PH Stumpf. 2006. A likelihood approach to analysis of network data. PNAS 103, 20 (2006), 7566–7570.
 Xu and Jordan (1996) Lei Xu and Michael I Jordan. 1996. On Convergence Properties of the EM Algorithm for Gaussian Mixtures. Neural Computation 8, 1 (1996), 129–151.