Maximum likelihood estimation for social network dynamics

# Maximum likelihood estimation for social network dynamics

[ [    [ [    [ [ University of Oxford, University of Groningen, University of Oxford
and Penn State University
T. A. B. Snijders
J. Koskinen
Nuffield College
United Kingdom
M. Schweinberger
Department of Statistics
Penn State University
326 Thomas Building
University Park, Pennsylvania 16802
USA
\smonth3 \syear2009\smonth8 \syear2009
\smonth3 \syear2009\smonth8 \syear2009
\smonth3 \syear2009\smonth8 \syear2009
###### Abstract

A model for network panel data is discussed, based on the assumption that the observed data are discrete observations of acontinuous-time Markov process on the space of all directed graphs on a given node set, in which changes in tie variables are independent conditional on the current graph. The model for tie changes is parametric and designed for applications to social network analysis, where the network dynamics can be interpreted as being generated by choices made by the social actors represented by the nodes of the graph. An algorithm for calculating the Maximum Likelihood estimator is presented, based on data augmentation and stochastic approximation. An application to an evolving friendship network is given and a small simulation study is presented which suggests that for small data sets the Maximum Likelihood estimator is more efficient than the earlier proposed Method of Moments estimator.

\kwd
\doi

10.1214/09-AOAS313 \volume4 \issue2 2010 \firstpage567 \lastpage588

\runtitle

Social network dynamics

{aug}

a]\fnmsTom A. B. \snmSnijders\thanksreft1label=e1]tom.snijders@nuffield.ox.ac.uk\corref, a]\fnmsJohan \snmKoskinen\thanksreft1label=e2]johan.koskinen@nuffield.ox.ac.uk
and b]\fnmsMichael \snmSchweinberger\thanksreft2label=e3]michael.schweinberger@stat.psu.edu \thankstextt1Supported in part by the US National Institutes of Health (NIH 1R01HD052887-01A2). \thankstextt2Supported in part by the Netherlands Organization for Scientific Research (NWO 401-01-552 and 446-06-029).

Graphs \kwdlongitudinal data \kwdmethod of moments \kwdstochastic approximation \kwdRobbins–Monro algorithm.

## 1 Introduction

Relations between social actors can be studied by methods of social network analysis [e.g., Wasserman and Faust (1994); Carrington, Scott and Wasserman (2005)]. Examples are friendship between pupils in a school class or alliances between firms. A basic data structure for social networks is the directed graph or digraph, where the actors are represented by the nodes, and the arcs between the nodes indicate the social ties. Social network analysis traditionally has had a focus on rich description of network data, but the recent development of methods of statistical inference for network data [e.g., Airoldi et al. (2007); Hunter and Handcock (2006)] has the potential of moving this field toward a wider use of inferential approaches. Longitudinal studies are especially important for obtaining insight into social networks, but statistical methods for longitudinal network data that are versatile enough for realistic modeling are only just beginning to be developed.

This article considers repeated observations of a relation, or network, on a given set of actors , observed according to a panel design, and represented as a sequence of digraphs for , where are the observation moments. The nodes represent social actors (which may be individuals, companies, etc.), and the node set is the same for all observation moments. A digraph is defined here as an irreflexive relation, that is, a subset of , and when we shall say that there is an arc, or a tie, from to . It often is realistic to assume that the social network has been developing between the observation moments, which leads to the assumption that the observations are realizations of stochastic digraphs embedded in a continuous-time stochastic process , . Holland and Leinhardt (1977) proposed to use continuous-time Markov chains, defined on the space of all digraphs with a given node set, for modeling social network dynamics even if the observations are made at a few discrete time points and not continuously. Continuous-time Markov chains provide a natural starting point for modeling longitudinal network data. Wasserman (1979, 1980) and Leenders (1995) elaborated dyad-independent models, where the ties between pairs of actors (dyads) develop according to processes that are mutually independent between dyads. This is not realistic for social processes, because dependence between the set of ties among three or more actors can be very strong, as was found already by Davis (1970) who showed that a basic feature of many social networks is the tendency toward transitivity (“friends of my friends are my friends”). Snijders and van Duijn (1997) and Snijders (2001) proposed so-called actor-oriented models, explained in the next section, which do allow such higher-order dependencies.

The actor-oriented models are too complicated for the calculation of likelihoods or estimators in closed form, but they represent stochastic processes which can be easily simulated. This was exploited by the estimation method proposed in the papers mentioned, which is a Method of Moments (MoM) estimator implemented algorithmically by stochastic approximation [Robbins and Monro (1951); also see, e.g., Kushner and Yin (2003)]. This estimator usually performs well [some empirical applications are given in de Federico (2003), van de Bunt, van Duijn and Snijders (1999), and van Duijn et al. (2003)].

It is to be expected, however, that the statistical efficiency of the Maximum Likelihood (ML) estimator will be greater. ML estimation also paves the way for likelihood-based model selection, which will be a marked improvement on existing methods of model selection. This article presents an MCMC algorithm for approximating the ML estimator, combining and extending ideas in Gu and Kong (1998), Snijders (2001), and Koskinen and Snijders (2007), the latter of which proposed for this model a MCMC algorithm for Bayesian inference. Section 2 presents the model definition. The algorithm for obtaining the ML estimator is described in Section 3. Section 4 reports results of an empirical example and Section 5 presents a very small simulation study comparing the ML and MoM estimators. The paper finishes with an algorithm for approximating the likelihood ratio test in Section 6 and a discussion in Section 7.

## 2 Model definition

We assume that repeated observations on the network are available for some . The network, or digraph, will be identified with its adjacency matrix, of which the elements denote whether there is a tie from node to node () or not (). Self-ties are not allowed, so that the diagonal is structurally zero. Random variables are denoted by capitals. The stochastic process , in which the observations are embedded, is modeled as being right-continuous.

Various models have been proposed, most of them being Markov processes of some kind. We focus on actor-oriented models [Snijders (2001)]. Tie-oriented models [Snijders (2006)] can be treated similarly. The basic idea of actor-oriented models [Snijders (2001)] is that the nodes of the graph represent social actors, who have control, albeit under constraints, of their outgoing ties; and the graph develops as a continuous-time Markov process (even though it is observed only at discrete time points). The constraints are that ties may change only one by one, and actors do not coordinate their changes of ties. Thus, at any given moment, one actor may create one new tie or delete one existing tie, where the probability distribution of such changes depends on the current digraph; excluded are simultaneous changes such as swapping one tie for another, or bargaining between actors over ties. This constraint was proposed already by Holland and Leinhardt (1977), and it has the virtue of splitting up the change process in its smallest possible constituents.

The actor-oriented model is further specified as follows; further discussion and motivation is given in Snijders (2001).

1. Opportunities for change

Each actor gets at stochastically determined moments the opportunity to change one of the outgoing tie variables (, ). Since the process is assumed to be Markovian, waiting times between opportunities have exponential distributions. Each of the actors has a rate function which defines how quickly this actor gets an opportunity to change a tie variable, when the current value of the digraph is , and where is a parameter. At any time point with , the waiting time until the next opportunity for change by any actor is exponentially distributed with parameter

 λ(α,x)=∑iλi(α,x). (1)

Given that an opportunity for change occurs, the probability that it is actor who gets the opportunity is given by

 πi(α,x)=λi(α,x)/λ(α,x). (2)

The rate functions can be constant between observation moments, or they can depend on functions which may be covariates or positional characteristics of the actors such as outdegrees . A convenient assumption is to use an exponential link function,

 λi(α,x)=exp(∑kαkrik(x)). (3)
1. Options for change

When actor gets the opportunity to make a change, this actor has a permitted set of values to which the digraph may be changed, where is the current value of the digraph. The assumption that the actor controls his or her outgoing ties, but can change only one tie variable at the time, is equivalent to {subequation}

 Ai(x0)⊂{x0}∪Ari(x0), (4)

where is the set of adjacency matrices differing from in exactly one element,

 Ari(x0) = {x∣xij=1−x0ij for one j≠i, and xhk=x0hk for all other (h,k)}.

Including in can be important for expressing the property that actors who are satisfied with the current network will prefer to keep it unchanged. Therefore, the usual model is . This does not lead to identifiability problems because the ratio between not making a change and making a change is not a free parameter, but fixed by assumption (6) for the conditional probabilities of the new state of the network.

Some alternatives are models with structurally impossible ties, where the impossible digraphs are left out of , and models where the actor is required to make a change whenever there is the opportunity, obtained by leaving the current element out of .

It is assumed that the network dynamics is driven by a so-called objective function that can be interpreted as the relative attractiveness for actor of moving from the network represented by to the network , and where is a parameter. Under the condition that the current digraph is and actor gets the opportunity to make a change, the conditional probability that the next digraph is  is modeled as

where the summation extends over . This formula can be motivated by a random utility argument as used in econometrics [see, e.g., Maddala (1983)], where it is assumed that the actor maximizes plus a random disturbance having a standard Gumbel distribution. Assumption (2) implies that instead of we can also write where the correspondence between and is defined as follows: if , is the unique element of for which ; if , . This less redundant notation will be used in the sequel. Thus, for , is the probability that, under the condition that actor  has the opportunity to make a change and the current digraph is , the change will be to with the rest unchanged, while is the probability that, under the same condition, the digraph will not be changed.

The most usual models are based on objective functions that depend on only. This has the interpretation that actors wish to maximize a function independently of “where they come from.” The greater generality of (6), where the objective function can depend also on the previous state , makes it possible to model path-dependencies, or hysteresis, where the loss suffered from withdrawing a given tie differs from the gain from creating this tie, even if the rest of the network has remained unchanged.

Various ingredients for specifying the objective function were proposed in Snijders (2001). A linear form is convenient,

 fi(β,x0,x)=L∑k=1βksik(x0,x), (7)

where the functions are determined by subject-matter knowledge and available social scientific theory. These functions can represent essential aspects of the network structure, assessed from the point of view of actor , such as

 sik(x0,x) = ∑jxij(outdegree) (12) ∑jxijxji(reciprocated ties) ∑j,kxijxjkxik(transitive % triplets) ∑j(1−xij)maxkxikxkj(indirect ties) ∑jx0ijx0jixijxji(% persistent reciprocity);

they can also depend on covariates—such as resources and preferences of the actors, or costs of exchange between pairs of actors—or combinations of network structure and covariates. For example, de Federico (2003) found in a study of friendship between foreign exchange students that friendship formation tends to be reciprocal, with a negative parameter for forming indirect ties (thus leading to relatively closed networks), and that friendships are more likely to be formed between individuals from the same region (a covariate effect), but that reciprocation adds less to the tendency to form ties between persons from the same region than between arbitrary individuals (negative interaction between covariate and reciprocity).

1. Intensity matrix; time-homogeneity

The model description given above defines as a continuous-time Markov process with -matrix or intensity matrix [e.g., Norris (1997)] for given by

The assumptions do not imply that the distribution of is stationary. The intensity matrix is time-homogeneous, however, except for time dependence reflected by time-varying components in the functions .

For the data-collection designs to which this paper is devoted, where observations are done at discrete time points , these time points can be used for marking time-heterogeneity of the transition distribution (cf. the example in Section 6). For example, covariates may be included with values allowed to change at the observation moments. A special role is played here by the time durations between successive observations. Standard theory for continuous-time Markov chains [e.g., Norris (1997)] shows that the matrix of transition probabilities from to is , where is the matrix with elements (13). Thus, changing the duration can be compensated by multiplication of the rate function by a constant. Since the connection between an externally defined real-valued time variable and the rapidity of network change is tenuous at best, it usually is advisable [e.g., see Snijders (2001)] to include a multiplicative parameter in the rate function which is constant between observation moments but differs freely between periods . With the inclusion of such a parameter, the numerical values become unimportant because changes in their values can be compensated by the multiplicative rate parameters.

### 2.1 Comparison with discrete-time Markov chain models

Popular models for analyzing cross-sectional network data are exponential families of graph distributions such as the Markov model of Frank and Strauss (1986), generalized to the Exponential Random Graph (ERG) model by Frank (1991) and Wasserman and Pattison (1996), and elaborated and further specified by Hunter and Handcock (2006) and Snijders et al. (2006). Discrete-time Markov chain models, as longitudinal models of this kind, were proposed by Robins and Pattison (2001), Krackhardt and Handcock (2007), and Hanneke and Xing (2007). There are a number of essential differences between these models and the actor-oriented model treated here, with respect to interpretation as well as statistical procedures.

When applying the actor-oriented model to a sequence of two or more repeated observations, these observations are embedded in a continuous-time model. This has clear face validity in applications where changes in the network take place at arbitrary moments between observations. Singer (2008) gives an overview of the use of this principle for continuously distributed data. This approach yields a major advantage over the cited longitudinal ERG models: our probability model is defined independently of the observational design. Data from irregularly spaced observation intervals can be analyzed without the need to make any adaptations. Another advantage is that the dynamic process is defined parsimoniously as a function of its elementary constituents—in this case, the conditional probability of a single tie change. The random utility interpretation of (6), discussed above, gives an interpretation in terms of myopic optimization of an objective function to which a random disturbance is added, and which can be used to represent a “social mechanism” that could have given rise to the observed dynamics.

The discrete-time ERG models constitute exponential families, which has the advantage that many standard techniques can be directly applied. The distribution of the continuous-time process for the actor-oriented model constitutes an exponential family, so for discrete observation moments our model can be regarded as an incompletely observed exponential family. For both types of model, inference is computer-intensive and time-consuming because of the need to implement elaborate MCMC procedures. The definition of the model given above can be used directly to simulate data from the probability distribution, conditional on an initial state of the network. This contrasts with models in the ERG family, which can be simulated only indirectly, by regarding them as the stationary distribution of an auxiliary Markov chain, and applying a Gibbs or Metropolis–Hastings algorithm. The near degeneracy problem [Snijders et al. (2006)] which plagues some specifications of ERG models is, in practice, not a problem for the actor-oriented model.

## 3 ML estimation

This section presents an algorithm for MCMC approximation of the maximum likelihood estimate. Estimation is done conditional on the first observation . This has the advantage that no model assumptions need to be invoked concerning the probability distribution that may have led to the first observed network , and the estimated parameters refer exclusively to the dynamics of the network.

The algorithm proposed below can be sketched as follows. For each () the observed data are augmented with random draws from the sequence of intermediate digraphs that could have led from one observation, , to the next, (Section 3.1). These draws can be simulated using a Metropolis–Hastings algorithm (Section 3.3). They are then used in the updating steps of a Robbins–Monro algorithm, following ideas of Gu and Kong (1998), to find the solution of the likelihood equation (Section 3.2). These elements are put together in Section 3.4.

### 3.1 Augmented data

The likelihood for the observed data conditional on cannot generally be expressed in a computable form. Therefore, the observed data will be augmented with data such that an easily computable likelihood is obtained, employing the general data augmentation principle proposed by Tanner and Wong (1987). The data augmentation can be done for each period separately and, therefore, this section considers only the observations and .

Denote the time points of an opportunity for change by and their total number between and by , the time points being ordered increasingly so that . The model assumptions imply that at each time , there is one actor, denoted , who gets an opportunity for change at this time moment. Define as the actor toward whom the tie variable is changed, and define formally if there is no change [i.e., if )]:

 (Ir,Jr) is the only (i,j) for which xij(Tr−1)≠xij(Tr) if there is such an (i,j); else Jr=Ir.

Given , the outcome of the stochastic process , completely determines .

The augmenting data consist of and , , without the time points . It may be noted that this differs from the definition of a sample path in Koskinen and Snijders (2007) in that the times in between opportunities for change are integrated out; the reason is to obtain a simpler MCMC algorithm. The possible outcomes of the augmenting data are determined by the condition

for all with . The stochastic process , will be referred to as the sample path; the elements for which , although redundant to calculate from , are retained because they facilitate the computation of the likelihood. Define ; the digraphs and differ in element provided , and in no other elements.

The probability function of the sample path, conditional on , is given by

 psp{V=((i1,j1),…,(iR,jR));α,β} =Pα,β{TR≤t2

where is defined in (2) and in and just after (6). Denote the first component of (3.1) by

 κ(α,x(0),(i1,j1),…,(iR,jR)) (16) =Pα,β{TR≤t2

Conditioning on [and not on !], the differences are independently exponentially distributed with parameters. Hence, under this conditioning the distribution of is the convolution of exponential distributions with parameters for . In the special case that the actor-level rates of change are constant, denoted by ,  has a Poisson distribution with parameter ; (3.1) then is given by

 κ(α,x(0),(i1,j1),…,(iR,jR))=exp(−nα1(t2−t1))(nα1(t2−t1))RR!. (17)

In the general case where the change rates are nonconstant, the probability (3.1) can be approximated as follows. Denote by the density function of , conditional on , or, equivalently, on the embedded process The distribution of is a convolution of exponential distributions and, therefore, the central limit theorem implies that the density function is approximately the normal density with expected value

 μα=R−1∑r=0{λ(α,x(r))}−1 (18)

and variance

 σ2α=R−1∑r=0{λ(α,x(r))}−2. (19)

Hence,

 pTR(t)≈1√(2πσ2α)exp(−(t−t1−μα)22σ2α). (20)

Probability (3.1) now can be expressed as

 κ(α,x(0),(i1,j1),…,(iR,jR)) =∫t2t1pTR(s)P{TR+1−TR>t2−TR∣TR=s}ds (21) ≈pTR(t2)∫t2t1exp(−λ(α,x(R))(t2−s))ds ≈pTR(t2)λ(α,x(R)).

The approximations are valid for large and , under boundedness conditions on the rate functions . The first approximation in (3.1) is based on splitting the integration interval into two intervals and for a bounded but large ; the first interval then gives an asymptotically negligible contribution and on the second interval is approximately constant since . The second approximation uses that is large. Combining the preceding equations yields

 κ(α,x(0),(i1,j1),…,(iR,jR)) (22) ≈1λ(α,x(R))√(2πσ2α)exp(−(t2−t1−μα)22σ2α).

This shows that, for observed data augmented by the sample path, the likelihood conditional on can be expressed directly, either exactly or in a good approximation.

### 3.2 Missing data principle and stochastic approximation

An MCMC algorithm will be used that finds the ML estimator based on augmented data. Several methods for MCMC maximum likelihood estimation have been proposed in the literature. We shall use the Markov Chain Stochastic Approximation (MCSA) algorithm proposed by Gu and Kong (1998) and used (in a slightly different specification) also by Gu and Zhu (2001). This algorithm is based on the missing information principle of Orchard and Woodbury (1972) and Louis (1982)—going back to Fisher (1925); cf. Efron (1977). The principle can be summarized as follows. Suppose that is observed, with probability distribution parameterized by and having probability density w.r.t. some -finite measure. To facilitate estimation, the observed data is augmented by extra data (regarded as missing data) such that the joint density is . Denote the observed data score function by and the total data score function by . It is not hard to prove (see the cited literature) that

 Eθ{SXV(θ;x,V)∣X=x}=SX(θ;x). (23)

This is the first part of the missing information principle. This equation implies that the likelihood equation can be expressed as

 Eθ{SXV(θ;x,V)∣X=x}=0, (24)

and, therefore, ML estimates can be determined, under regularity conditions, as the solution of (24).

This is applied in situations where the observed data score function is too difficult to calculate, while the total data score function is computable. In our case, we condition on , so this is treated as being fixed; we observe ; and between each pair of consecutive observations and the data are augmented, as discussed in the previous section, by the sample path that could have brought the network from  to . These sample paths combined for to constitute . The following two subsections supply the additional elements for how the augmenting data are used.

In the MCSA algorithm of Gu and Kong (1998), the solution of (24) is obtained by stochastic approximation [Robbins and Monro (1951); Kushner and Yin (2003)] which is defined by the updating step

 ^θ(N+1)=^θ(N)+aND−1SXV(^θ(N);x,V(N)), (25)

where is generated according to the conditional distribution of , given , with parameter value . The sequence , called the gain sequence, consists of positive numbers, tending to 0. The matrix is a suitable matrix. It is efficient [see Kushner and Yin (2003)] to use a gain sequence tending to zero as for a , and to estimate not by the last element produced by the algorithm, but by a tail average . For fixed and , this average converges to the solution of (24) for a wide range of positive definite matrices .

The second part of the missing information principle [Orchard and Woodbury (1972)] is that the observed Fisher information matrix for the observed data can be expressed as

 DX(θ) = −∂SX(θ;x)/∂θ = Eθ{DXV(θ)∣X=x}−Covθ{SXV(θ;x,V)∣X=x},

where is the complete data observed Fisher information matrix,

 DXV(θ)=−∂SXV(θ;x,V)/∂θ. (27)

Expression (3.2) can be interpreted loosely as “information is total (but partially unobserved) information minus missing information.” This formula allows us to calculate standard errors.

### 3.3 Simulating the sample path

The MCSA method relies on Monte Carlo simulations of the missing data . The Markov property allows us to treat all periods separately and, therefore, we simplify the treatment and notation here by treating only one of those periods, temporarily assuming that . The missing data then is the sample path which specifies the sequence of tie changes that brings the network from to . The advantage of having integrated out the time steps —with the use of the expressions (17) and (3.1)—is that less random noise is introduced, and the simulated variable  is discrete rather than including a Euclidean vector with a varying dimension , which would require a more complicated MCMC procedure.

The set of all sample paths connecting and is the set of all finite sequences of pairs , , where for the parity of the number of occurrences of is given by (14). Such sequences will be denoted by , and the set of all these sequences is denoted . The probability of the sample path conditional on is proportional to (3.1), rewritten here as

 (28)

where is given by (17) or approximated by (3.1), depending on the specification of . Draws from this distribution can be generated by the Metropolis–Hastings algorithm, provided that a proposal distribution is used which connects any two elements of .

Such a proposal distribution will now be given, denoting the proposal probabilities by and the target probabilities, which are proportional to (28), by . Then the acceptance probabilities in the Metropolis–Hastings algorithm, for a current state and a proposed state , are

 (29)

A proposal distribution is used that consists of small changes in . The construction of the proposal distribution was based on the considerations that the proposal distribution should mix well in the set of all sample paths, and the Metropolis–Hastings ratios in (29) should be computable relatively easily. This led to proposal distributions consisting of the following types of small changes in :

1. Paired deletions. Of all pairs of indices with ,, one pair is randomly selected, and and are deleted from .

2. Paired insertions. A pair with is randomly chosen, two indices are randomly chosen, and the element is inserted immediately before and before .

3. Single insertions. At a random place in the path (allowing beginning and end), the element is inserted for a random .

4. Single deletions. Of all elements satisfying , a randomly chosen one is deleted.

5. Permutations. For randomly chosen , where is bounded from above by some convenient number to avoid too lengthy calculations, the segment of consecutive elements is randomly permuted.

It is evident that these five operations yield new sequences within the permitted set [cf. (14)]. Paired insertions and paired deletions are each others’ inverse operations, and the same holds for single insertions and single deletions. These four types of operation together are sufficient for all elements of to communicate. Permutations are added to achieve better mixing properties.

The detailed specification of how these elements are combined in the proposal distribution can be obtained from the authors. Two considerations guided this specification. In the first place, transparency of the algorithm. Paired insertions and paired deletions are not always unique inverses of each other. Nonuniqueness of the inverse operation would lead to complicated counting procedures to determine proposal probabilities. Therefore, the restriction is made that pairs of elements can only be inserted at, or deleted from a pair of positions in the sequence, if there are no occurrences of the same between these positions; and only if at least one other (with or ) occurs in between these positions. In this restricted set of operations, paired insertions and paired deletions are each other’s unique inverses, which simplifies proposal probabilities. Due to the presence of permutations, all elements of the space still are reachable from any element.

The second consideration is computational efficiency. When proposing that a pair of elements is inserted or deleted at certain positions, then also proposing to permute a segment between those positions entails no increase in computational load for calculating the Metropolis–Hastings ratios, and this permutation will lead to larger changes in , and thereby, hopefully, better mixing for a given number of computations.

### 3.4 Putting it together

This subsection combines the elements presented in the preceding subsections to define an algorithm for MCMC approximation of the ML estimate. It is now assumed that an arbitrary number of repeated observations has been made: . As argued before, we condition on the first observation . The further data is denoted by . The parameter is denoted by .

The Markov property entails that the observed data score function, conditional on , can be decomposed as

 SX∣X(t1)(θ;x(t2),…,x(tM)|x(t1)) (30) =M∑m=2SX(tm)|X(tm−1)(θ;x(tm)|x(tm−1)),

where is the score function based on the conditional distribution of , given .

For the period from to , the data set is augmented by which defines the order in which ties are changed between these time points, as described in Section 3.1; this can be denoted by

 Vm=((Im1,Jm1),…,(ImRm,JmRm))

with outcome . The augmenting variable as used in Section 3.1 is . Denote the probability (3.1) for the period from to by , and the corresponding total data score function by

 Sm(θ;x(tm−1),vm)=∂logpm(vm;θ|x(tm−1))∂θ. (31)

From (23) and (3.4), and using the Markov property, it can be concluded that the observed data score function now can be written as

 SX∣X(t1)(θ;x|x(t1)) (32) =M∑m=2Eθ{Sm(θ;x(tm−1),Vm)|X(tm−1)=x(tm−1),X(tm)=x(tm)}.

The ML estimate is obtained as the value of for which (3.4) equals 0 [cf. (24)].

The algorithm is iterative, and the th updating step now can be represented as follows:

1. For each , make a large number of the Metropolis–Hastings steps as described in Section 3.3, yielding .

2. Compute

using (31) with defined by (3.1) for the period from to , and using (3.1) and (17) or (3.1).

3. Update the provisional parameter estimate by

 ^θ(N+1)=^θ(N)+aND−1SXV(^θ(N);x,v(N)).

As mentioned in Section 3.2, the estimate is calculated as a tail average of the values generated by this algorithm. The covariance matrix of the ML estimator is estimated using (3.2), where the expected values in the right-hand side are approximated by Monte Carlo simulation of the conditional distribution of  given , for . This involves the matrix of partial derivatives (27), which can be estimated by a score-function method as elaborated in Schweinberger and Snijders (2006).

The main implementation details are the following:

1. The Method of Moments (MoM) estimator [Snijders (2001)], in practice, yields a good initial value .

2. To avoid long burn-in times for step (1.), the Metropolis–Hastings algorithm for generating can be started from the previous value, , rather than independently. This leads to some autocorrelation in the updates defined in step (3.), and the number of Metropolis–Hastings steps must be large enough that this autocorrelation is not too high. It may be noted that the Robbins Monro algorithm is robust to moderate dependence between successive updates [Kushner and Yin (2003)]. We have found good results when the number of steps is tuned so that the autocorrelations between the elements of are less than 0.3.

3. For in step (3.) we use a Monte Carlo estimate of the complete data observed Fisher information matrix (27), estimated for before making the iteration steps.

4. For the other numerical parameters of the algorithm we use the same values as described, for the stochastic approximation algorithm to compute the MoM estimator, in Snijders (2001).

This algorithm is implemented in Siena version 3.3 [Snijders et al. (2009)]. The executable program as well as the code can be found athttp://www.stats.ox.ac.uk/siena/.

## 4 Empirical example

As an illustrative example, the data set is used that was also analyzed in van de Bunt, van Duijn and Snijders (1999) and in Snijders (2001). This is a friendship network between 32 freshman students in a given discipline at a Dutch university. Initially most of the students were unknown to each other. There were six waves denoted of data collection, with 3 weeks between waves for the start of the academic year, and 6 weeks in between later. The relation studied is “being friends or close friends.” The data set is obtainable from website http://www.stats.ox.ac.uk/ siena/.

The transitions between observations to , and to , will here be studied separately. To identify the rate function, we assume (arbitrarily but conveniently) that the duration of the time periods is unity, . To keep the model specification relatively simple, the rate function is supposed to be constant across actors, given by from to and by from to ; and the objective function (7) is chosen independent of the previous state , containing contributions of the effects of outdegree, number of reciprocated ties, number of transitive triplets, number of 3-cycles, gender of the sender of the tie (“ego”), gender of the receiver of the tie (“alter”), and gender similarity:

 fi(β,x) = β1∑jxij+β2∑jxijxji +β3∑j,kxijxjkxik+β4∑j,kxijxjkxki+β5∑jxij(zj−¯z) +β6∑jxij(zi−¯z)+β7∑jxij{1−|zi−zj|},

where variable indicates the gender of actor (, ) and its average over the 32 individuals. This model illustrates two types of triadic dependence (transitive triplets and 3-cycles) and the use of covariates (gender).

For this model the parameters are estimated for the two transitions and separately both by the Method of Moments (MoM) and by Maximum Likelihood. For models where the functions used in (7) depend only on and not on , so that they can be expressed as , the MoM estimator as defined by Snijders (2001) is based on the vector with components for and for . The MoM estimator is defined as the parameter vector for which the observed and expected values of this statistic are equal, and can be determined by stochastic approximation.

Both the moment equation and the likelihood equation can be represented as , where is the difference between observed and estimated moments or the complete-data score function, respectively. For both estimators, after running the stochastic approximation algorithm, convergence was checked by simulating the model for the obtained parameters for 2000 runs and calculating, for each component of , the ratio of the average simulated to its standard deviation. In all cases, this ratio was less than 0.1, indicating adequate convergence.

Parameter estimates and standard errors are reported in Table 1. Assuming that the estimators are approximately normally distributed (which is supported by the simulations reported in the next section, although we have no proof), the significance can be tested by referring the ratios of estimate to standard error to a standard normal distribution. The Method of Moments and Maximum Likelihood estimates are different but lead to the same substantive conclusions. The parameters reflecting network structure, to , give the same picture for both transitions. The negative indicates that an outgoing tie which is not reciprocated and not transitively embedded is not considered attractive; the positive and indicate that there is evidence for tendencies toward reciprocation and transitivity of friendship choices, when controlling for all other effects in the model. For interpreting the 3-cycles effect, note that closed 3-cycles are structures denying a hierarchically ordered relation. The negative indicates a tendency away from closed 3-cycles, which—in conjunction with the positive transitive triplets parameter—could be interpreted as a tendency toward a local (i.e., triadic) popularity hierarchy in friendship. For gender, there is only a close to significant effect of gender alter for the transition, suggesting that male students tend to be more popular as friends, when controlling for the other effects.

From this single data example, of course no conclusions can be drawn concerning the relative value of these two estimation methods.

## 5 Simulation examples

A comparison between the finite sample behavior of the MoM and the ML estimators can be based on simulations. Here we present a small simulation study as a very limited exploration of the relative efficiency of the two estimators, which is expected to be in favor of the ML estimator. The limited nature of this simulation study does not allow generalization, but the study design is meant to be typical for applications to friendship networks in rather small groups, and replicates approximately the empirical study of the previous section.

The model is identical to that of the previous section: there are three repeated observations, 32 actors, one binary covariate called gender, and the first observed network as well as the distribution of the covariate are identical to the van de Bunt data set at observation . Therefore, the observation moments are again referred to as . The parameter values are rounded figures close to the estimates obtained in the preceding section. Networks for times and are generated and parameters estimated under the assumption that parameters are the same in periods and . The simulation model has parameters and , , and . A total of 1000 data sets were generated and the estimates calculated by both methods. Table 2 reports the average estimates, the root mean squared errors, the rejection rates for testing the data-generating value of the parameter as the null hypothesis (estimating type-I error rates), and the rejection rates for testing that the parameters equal 0 (estimating power), where the tests were two-sided tests based on the -ratio for the corresponding parameter estimate, assuming a standard normal reference distribution, at a nominal significance level of 5%.

Out of the 1000 generated data sets, 5 were excluded because they did not produce well converging results in the default specification of the algorithm for one or both estimators. Table 2 shows that the results for the two estimators are very similar, and the type-I error rates are close to the nominal value, except for the inflated type-I rates of the ML estimators for the two rate parameters. The latter is related to the skewed distribution of the rate parameter estimators. The correlations between the estimators are more than 0.93 for all coordinates; note that correlations are attenuated due to the stochastic nature of the algorithms. It can be concluded that for this type of model, characterized by 32 actors and 3 waves with rates of change 2.5 and 3.5, with 7 parameters in the objective function, MoM and ML estimation yield quite similar results.

To explore data sets with less information, a similar simulation study was conducted with 20 actors, where the first network was induced by the network of 20 of the actors in the van de Bunt data set, and the rest of the simulation design differed from the previous study by including an interaction effect of reciprocity by gender similarity, represented by the term

 β8∑jxijxji{1−|zi−zj|},

with parameter . This effect is included to achieve a higher correlation of the parameter estimators, which together with the smaller number of actors should lead to greater difficulties in estimation.

The results are shown in Table 3. Of the 1000 generated data sets, 20 were excluded from the results because of questionable convergence of the algorithm. The table shows that the ML estimator here performs clearly better than the MoM estimator. The estimated relative efficiency of MoM compared to ML expressed as ratio of mean squared errors ranges for these 10 parameters from 0.50 to 0.99. The correlation between the estimators ranges from 0.71 (gender ego) to 0.98 (rate period 1). For all parameters except , the estimated power of the test based on the ML estimator is higher than that of the MoM-based test, which can be traced back to the combination of a smaller mean squared error and a less conservative test. The surprisingly low power of the MoM-based test for the gender-ego effect reflects high standard errors that tend to be obtained for estimating .

## 6 Likelihood ratio tests

One of the important advantages of a likelihood approach is the possibility of elaborating model selection procedures. Here we only explain how to conduct a likelihood ratio test.

A convenient way to estimate the likelihood ratio is by a simple implementation of the idea of the path sampling method described by Gelman and Meng (1998), who took this method from the statistical physics literature where it is known as thermodynamical integration. For arbitrary and , defining the function , with , this method is based on the equation

 log(pX(x;θ1)pX(x;θ0))=∫10˙θSX(x;θ(t))dt. (33)

This integral can be approximated by replacing the integral by a finite sum and using (23). Applying a simulation procedure which for each , , generates draws from the conditional distribution of given under parameter , the log likelihood ratio (33) can be approximated by

 (θ1−θ0)L(H+1)L∑ℓ=1H∑h=0SXV(θ(h/H);x,Vhℓ). (34)

Burn-in time can be considerably reduced when starting the MCMC algorithm for generating by the end result of the algorithm used to generate .

This was applied to the van de Bunt data set used in Section 4 to test the null hypothesis that all parameters