On latent position inference from doubly stochastic messaging activities
Abstract
We model messaging activities as a hierarchical doubly stochastic point process with three main levels, and develop an iterative algorithm for inferring actors’ relative latent positions from a stream of messaging activity data. Each of the messageexchanging actors is modeled as a process in a latent space. The actors’ latent positions are assumed to be influenced by the distribution of a much larger population over the latent space. Each actor’s movement in the latent space is modeled as being governed by two parameters that we call confidence and visibility, in addition to dependence on the population distribution. The messaging frequency between a pair of actors is assumed to be inversely proportional to the distance between their latent positions. Our inference algorithm is based on a projection approach to an online filtering problem. The algorithm associates each actor with a probability densityvalued process, and each probability density is assumed to be a mixture of basis functions. For efficient numerical experiments, we further develop our algorithm for the case where the basis functions are obtained by translating and scaling a standard Gaussian density.
Key words. Social network; Multiple doubly stochastic processes; Classification; Clustering
AMS subject classifications. 62M0, 60G35, 60G55
1 Introduction
Communication networks are presenting everincreasing challenges in a wide range of applications, and there is great interest in inferential methods for exploiting the information they contain. A common source of such data is a corpus of timestamped messages such as emails or SMS (short message service). Such messaging data is often useful for inferring a social structure of the community that generates the data. In particular, messaging data is an asset to anyone who would like to cluster actors according to their similarity. A practitioner is often privy to messaging data in a streaming fashion, where the word streaming describes a practical limitation, as the practitioner might be privy only to the incoming data in a fixed summarized form without any possibility to retrieve past information. It is in the practitioner’s interest to transform the summarized data so that the transformed data is appropriate for detecting emerging social trends in the source community.
We mathematically model such streaming data as a collection of tuples of the form of time and actors, where and represent actors exchanging the th message and represents the occurrence time of the th message. There are many models suitable for dealing with such data. The most notable are the Cox hazard model, the doubly stochastic process (also known as the Cox process), and the selfexciting process (although selfexciting processes are sometimes considered as special cases of the Cox hazard model). For references on these topics, see Andersen et al. (1995), Snyder (1975) and Bremaud (1981). All three models are related to each other; however, the distinctions are crucial to statistical inference as they stem from different assumptions on information available for (online) inference. To transform data to a data representation more suitable for clustering actors, we model as a (multivariate) doubly stochastic process, and develop a method for embedding as a stochastic process taking values in for some suitably chosen .
2 Related works
For statistical inference when there is information available beyond , the Coxproportional hazard model is a natural choice. In Heard et al. (2010) and Perry and Wolfe (2013), for instance, instantaneous intensity of messaging activities between each pair of actors is assumed to be a function of, in the language of generalized linear model theory, known covariates with unknown regression parameters. More specifically, in Heard et al. (2010), the authors consider a model where with and representing independent counting processes, e.g., are Bernoulli random variables and are random variables from the exponential family. On the other hand, in Perry and Wolfe (2013), a Cox multiplicative model was considered where . The model in Perry and Wolfe (2013) posits that actor interacts with actor at a baseline rate modulated by the pair’s covariate whose value at time is known and is a common parameter for all pairs. In Perry and Wolfe (2013), it is shown under some mild conditions that one can estimate the global parameter consistently. In Stomakhin et al. (2011), the intensity is modeled for adversarial interaction between macro level groups, and a problem of nominating unknown participants in an event as a missing data problem is entertained using a selfexciting point process model. In particular, while no explicit intensity between a pair of actors (gang members) is modeled, the event intensity between a pair of groups (gangs) is modeled, and the spatiotemporal model’s chosen intensity process is selfexciting in the sense that each event can affect the intensity process.
When data is the only information at hand, a common approach is to construct a time series of (multi)graphs to model association among actors. For such an approach, a simple method to obtain a time series of graphs from is to “pairwise threshold” along a sequence of nonoverlapping time intervals. That is, given an interval, for each pair of actors and , an edge between vertex and vertex is formed if the number of messaging events between them during the interval exceeds a certain threshold. This is the approach taken in Cortes et al. (2003); Eckmann et al. (2004); Adamic and Adar (2005), Lee and Maggioni (2011) and Ranola et al. (2010), to mention just a few examples. The resulting graph representation is often thought to capture the structure of some underlying social dynamics. However, recent empirical research, e.g., Choudhury et al. (2010), has begun to challenge this approach by noting that changing the thresholding parameter can produce dramatically different graphs.
Another useful approach when is the only information available is to use a doubly stochastic process model in which count processes are driven by latent factor processes. This is the approach taken explicitly in Lee and Priebe (2011) and Tang et al. (2013), and this is also done implicitly in Chi and Kolda (2012). In Lee and Priebe (2011) and Tang et al. (2013) interactions between actors are specified by proximity in their latent positions; the closer two actors are to each other in their latent configuration, the more likely they exchange messages. Using our model, we consider a problem of clustering actors “online” by studying their messaging activities. This allows us a more geometric approach afforded by embedding data to an representation for some fixed dimension .
In this paper, we propose a useful mathematical formulation of the problem as a filtering problem based on both a multivariate point process observation and a population latent position distribution.
3 Notation
As a convention, we assume that a vector is a column vector if its dimension needs to be disambiguated. We denote by the filtration up to time that models the information generated by undirected communication activities between actors in the community, where “undirected” here means we do not know which actor is the sender and which is the receiver. We denote by the space of probability measures on . For a probability density function defined on , denotes the probability density function that is proportional to where the normalizing constant does not depend on . The set of all matrices over the reals is denoted by . For each matrix , we write . Given a vector , we write for its Euclidean norm. Let and . For each and , we write for the Hadamard product of and , i.e., denotes componentwise multiplication. Given vectors in , the Gram matrix of the ordered collection is the matrix such that its  entry is the inner product of and . Given a matrix , is the column vector whose th entry is the th diagonal element of . With a slight abuse of notation, given a vector , we will also denote by the diagonal matrix such that its th diagonal entry is . We always use for the number of actors under observation and for the dimension of the latent space. We denote by the fold product of . An element of will be written in bold face letters, e.g. . Similarly, bold faced letters will typically be used to denote objects associated with the actors collectively. An exception to this convention is the identity matrix which is denoted by , where the dimension is specified only if needed for clarification. Also, we write as the column matrix of ones. With a bit of abuse of notation, we also write for an indicator function, and when confusion is possible, we will make our meaning clear.
4 Hierarchical Modeling
Our actors under observation are assumed to be a subpopulation of a bigger population. That is, we observe actors that are sampled from a population for a longitudinal study. We are not privy to the actors’ latent features that determine the frequency of pairwise messaging activities, but we do observe messaging activities . A notional illustration of our approach thus far is summarized in Figure LABEL:fig:hierarchalmodeldiagram, Figure LABEL:fig:subpop_historgrams, and Figure LABEL:fig:KullbackLeibler_divergence__Simulation. In both Figure LABEL:fig:subpop_historgrams and Figure LABEL:fig:KullbackLeibler_divergence__Simulation, represents the (same) initial time when there was no cluster structure, and and represent the emerging and fully developed latent position clusters which represent the object of our inference task.
Population density process level
The messagegenerating actors are assumed to be members of a community, which we call the population. The aspect of the population that we model in this paper is its members’ distribution over a latent space in which the proximity between a pair of actors determines the likelihood of the pair exchanging messages. The population distribution is to be timevarying and a mixture of component distributions.
The latent space is assumed to be for some , and the population distribution at time is assumed to have a continuous density . To be more precise, we assume that the (sample) path is such that for each ,
where

is a smooth sample path of a stationary (potentially degenerate) diffusion process taking values in ,

is a probability density function on with convex support with its mean vector being the zero vector and its covariance matrix being a positive definite (symmetric) matrix,

is a smooth sample path of an valued (potentially nonstationary or degenerate) diffusion process,

is a smooth sample path of a stationary (potentially degenerate) diffusion process taking values in .
Note that it is implicitly assumed that , and additionally, we also assume that for each and , the th moment of the th coordinate of is finite, i.e., .
In this paper, we take and as exogenous modeling elements. However, for an example of a model with yet further hierarchical structure, one could take a cue from a continuous time version of the classic “cointegration” theory, e.g., see Comte (1999). The idea is that the location of the th center is nonstationary, but the interpoint distance between a combination of the centers is stationary. More specifically, one could further assume that there exist matrix , matrix and matrix such that

is the dimensional zero matrix,

is a dimensional Brownian motion,

is a stationary diffusion process.
Thus, the position of centers are unpredictable, but the relative distance between each pair of centers are as predictable as that of a stationary process.
Actor position process level
Figure LABEL:fig:subpop_historgrams sketches the connection between actors and populations. We first define a process for a single actor. To begin, for each , given and , we write
where
The formulation here for the and is based on a quadratic Taylorseries approximation of a socalled “bounded confidence” model studied in GomezSerrano et al. (2012). Here, the value of represents the confidence level of actor on its current position and represents the visibility of other actors’ position by actor . Roughly speaking, an actor with a small value of and a large value of will be influenced greatly by actors that are positioned both near and far in the latent space whereas an actor with a large value of and a small value of will be influenced only a small amount by actors that are nearby in the latent space. For further discussion on our motivation for the form of , see Appendix LABEL:appendix:motivationforA.
For each actor , the deterministic path is assumed to be continuous, taking values in a compact subset of . It is assumed that given , each actor’s latent position process is a diffusion process whose generator is , and moreover we assume that are mutually independent. For each , let
where each is assumed to be a column vector, i.e., a matrix. In other words, the th row of is the transpose of .
Messaging process level
Denote by the number of messages sent from actor to actor . Also, denote by the number of messages exchanged between actor and actor . Note that . For each actor , we assume that the path is deterministic, continuous and takes values in . For each , we assume that
For our algorithm development and Experiment in Section LABEL:sec:NumericalExperiments, we take
\hb@xt@.01(4.1) 
but for Experiment in Section LABEL:sec:NumericalExperiments, we take . Next, by way of assumption, for each pair, say, actor and actor , we eliminate the possibility that both actor and actor send messages concurrently to each other. More specifically, we assume that
\hb@xt@.01(4.2)  
\hb@xt@.01(4.3) 
For future reference, we let
\hb@xt@.01(4.4)  
\hb@xt@.01(4.5) 
5 Algorithm for computing posterior processes
We denote by the conditional distribution of given , i.e., for each ,
\hb@xt@.01(5.1) 
For the rest of this paper, we shall assume that the (random) measure is absolutely continuous with respect to Lebesgue measure with its density denoted by . That is, for . Denote by the th marginal posterior distribution of , i.e., for each , , and let denote its density.
5.1 Theoretical background
In Theorem LABEL:thm:exactposterior, the exact formula for updating the posterior is presented, and in Theorem LABEL:thm:simplifyingposteriorupdaterule, our working formula used in our numerical experiments is given. We develop our theory for the case where and are the same for all actors for simplicity, as generalization to the case of each actor having different values for and is straightforward but requires some additional notational complexity.
Theorem 5.1
For each and ,
where is an matrix such that for each , and for each , , and is an matrix such that for each pair , and for each pair , .
Hereafter, for developing algorithms further for efficient computations, we make the assumption that for each ,
\hb@xt@.01(5.2) 
where denotes the joint density for actors , and .
Theorem 5.2
For each function , we have
Replacing with a Dirac delta generalized function, Theorem LABEL:thm:simplifyingposteriorupdaterule states that for each ,
where denotes the formal adjoint operator of . For use only within Algorithm LABEL:algo:EstimateActorPosterior,
\hb@xt@.01(5.3)  
\hb@xt@.01(5.4) 
5.2 A mixture projection approach
The projection filter is an algorithm which provides an approximation to the conditional distribution of the latent process in a systematic way, the method being based on the differential geometric approach to statistics, cf. Bain and Crisan (2009). When the space on which we project is a mixture family, as in Brigo (2011), the projection filter is equivalent to an approximate filtering via the Galerkin method, cf. Gunther et al. (1997). Following this idea, starting from Theorem LABEL:thm:simplifyingposteriorupdaterule, we obtain below in Theorem LABEL:thm:ProjFilterZakai the basic formula for our approximate filtering algorithm.
To be more specific, consider a set of probability density functions . Then, let be the space of all probability density functions that can be written as a probabilityweighted sum of . That is, if and only if for some probability vector on indices . In particular, for deriving our algorithms, we will assume that for some systematic choice of , each probability density under consideration is a member of .
Among many possible choices for in Theorem LABEL:thm:ProjFilterZakai are a multivariate Haar wavelet basis and a multivariate Daubechies basis. On the other hand, a Gaussian mixture model is pervasively used throughout statistical inference tasks such as clustering and classification in algorithms such as means clustering. As such, we develop our algorithms with an eye towards use with other Gaussian mixture modelbased algorithms. In Appendix LABEL:appendix:mixtureprojectionformula, we further develop our algorithm under the assumption that
where is the standard Gaussian density function defined on , , and the finite sequence is to be chosen judiciously prior to implementing the algorithm.
Preparing for our next result in Theorem LABEL:thm:ProjFilterZakai, we let be the symmetric matrix such that , and for each , let be the symmetric matrix such that its entry is . Collectively, we denote by the threeway tensor whose entry is . Let be the matrix such that its entry is , where is the differential operator such that
\hb@xt@.01(5.5) 
with
\hb@xt@.01(5.6)  
\hb@xt@.01(5.7) 
Theorem 5.3
Suppose that for each , and , . Let denote the column vector whose th entry is . Then,
\hb@xt@.01(5.8) 
5.3 Algorithm for continuous embeddings
5.3.1 Classical multidimensional scaling
In our application, our final analysis is completed by clustering the posterior distributions. Instead of working directly with posteriors, an infinitedimensional object, we propose to work with objects in an Euclidean space each of which represents a particular actor. However, given and , using their mean vectors or their KL distance for clustering can be uninformative. For example, if , then their mean vectors would be the same and their KL distance would be zero. However, if is the density of, say, a normal random vector such that its mean is zero and its covariance matrix is for a large value of , then concluding that actor and actor are similar could be misleading.
To alleviate such situations in a clustering step of our numerical experiments, we propose using a multivariate statistical technique called classical multidimensional scaling (CMDS) to obtain a lower dimensional representation of . More specifically, we achieve this by representing each actor as a point in , where the configuration is obtained by solving the optimization problem
\hb@xt@.01(5.9) 
where denotes a strictly decreasing function defined on taking values in . For example, one can take where is chosen so that for all possible values of . Another possibility among many others is to choose if is chosen so that for all possible values of .
We denote by the set of solutions to the optimization problem (LABEL:eqn:optsoln). Given a vector of probability densities on , it can be shown that the solution set is not empty and is closed under orthogonal transformations.
In the classical embedding literature, ensuring continuous embeddings is neglected as it is not relevant to their applications. However, for our work, this is crucial as we study their evolution through time, i.e., ideally, we would like to see that a small change in time corresponds to a small change in latent location. In this section, we propose an extension to the CMDS algorithm to remedy the aforementioned nonuniqueness issue, and show that the resulting algorithm ensures continuity of embeddings.
In our numerical experiments, for each , we choose a particular element of the solution set so that depends on in a consistent manner.
5.3.2 Continuous selection
By a dissimilarity matrix, we shall mean a real symmetric nonnegative matrix whose diagonal entries are all zeros. First, fix such that . Then, for each dissimilarity matrix , define
where . The elements of are known as classical multidimensional scalings, and as discussed in Borg and Groenen (2005), it is well known that is not empty provided that the rank of is at least . Our discussion in this section concerns making a selection from so that the map is continuous over the set of dissimilarity matrices such that is of rank at least .
Let be a dissimilarity matrix such that the rank of is at least . We begin by choosing an element of , say , through classical dimensional scaling. Let be the eigenvalue decomposition of , where and is the diagonal matrix whose entries are the eigenvalues in nonincreasing order. By the rank condition, we have . First, we formally write
\hb@xt@.01(5.10) 
where
 (i)

is the matrix with its entry ,
 (ii)

is the diagonal matrix whose th diagonal entry is .
Dependence of on will be suppressed in our notation unless needed for clarity. Now, if the diagonals of are distinct, then is well defined. However, in general, due to potential geometric multiplicity of an eigenvalue, our definition of can be ambiguous. This is the main challenge in making a continuous selection and we resolve this issue in our following discussion.
For our remaining discussion, without loss of generality, we may assume that for each dissimilarity matrix , is welldefined by making an arbitrary choice if there is more than one CMDS solution. Note that the mapping may not be a continuous selection. We now remedy this. First, fix an matrix and let
where runs over all real orthogonal matrices. Then, define
\hb@xt@.01(5.11) 
Algorithm LABEL:algo:CMDScc yields the solution and the proof of the following theorem, Theorem LABEL:thmstat:continuousembedding, can be found in Appendix LABEL:thmstat:contembedding:proof.
Theorem 5.4
Suppose that the matrix is of full column rank. Then, the mapping yields a welldefined continuous function on the space of dissimilarity matrices such that is of rank at least .
5.4 Technical observations
Here we discuss some insightful facts related to our model given the assumption stated in the last section. First, we have the following:
Theorem 5.5
Fix and suppose that for a.e. . The operator is elliptic, i.e., for each , the matrix is positive definite.
Proof. Note that for each ,
Note that and for each , and that away from a subspace of whose Lebesgue measure is zero. It follows that for each , whence is positive definite.
Now, we further examine Algorithm LABEL:algo:PopulationProcess, Algorithm LABEL:algo:LatentProcess, Algorithm LABEL:algo:MessagingActivities, Algorithm LABEL:algo:EstimateActorPosterior, Algorithm LABEL:algo:CMDScc, and discuss some technical points behinds these algorithms.
In Algorithm LABEL:algo:LatentProcess, existence and uniqueness of follows from Theorem LABEL:thm:elliptic. The continuity of follows from Theorem LABEL:thm:elliptic and Strook (2008). In Algorithm LABEL:algo:MessagingActivities, for simulating a sample path of , we use the socalled timechange property; that is, we use the fact that the process given by is a unitrate simple Poisson process, where and and . For simulation, we use its dual result, i.e., is a point process whose intensity process is , where is a path of a unitrate simple Poisson process. Also, we note that for computation of , online inference is a necessary part of our simulation in Algorithm LABEL:algo:MessagingActivities; that is, we need to compute In Algorithm LABEL:algo:MessagingActivities and Algorithm LABEL:algo:EstimateActorPosterior, nearinfinitesimallysmall means so small that the likelihood of having more than one event during a time interval of length is practically negligible. Also, by StandardNormalVector in Algorithm LABEL:algo:LatentProcess and UnitExponentialVariable in Algorithm LABEL:algo:MessagingActivities, we mean generating, respectively, a single normal random vector with its mean vector being zero and its covariance matrix being the identity matrix, and a single exponential random variable whose mean is one.
6 Simulation experiments
In our experiments, we hope to detect clusters with accuracy and speed similar to that possible if the latent positions were actually observed even though we use only information in estimated from information contained in . We denote the endtime for our simulation as . There are two simulation experiments presented in this section, and the computing environment used in each experiment is reported at the end of this section.
Experiment 1
We take and we assume that for each and actor , and . For the population process we take for each
where
Then, we also consider , where
There is only one population density; in other words, . Note that even with one population center, we can have more than one empirical mode for the subpopulation. One of these modes is near zero, and another mode is near one. The reason for this is that because of the value of and , when an actor is too far away from the mode of the population process, the population process affects the actors on its tail only by negligibly small amount. In Figure LABEL:fig:experimentonesixactorpath and Figure LABEL:fig:experimentoneperfectfilteringcase, a sample path of the true latent position of each of eight actors is illustrated in black lines. It is apparent that in the , all eight actors are equally informed of the population mode shift, but in the case, only the last three were able to adapt to the change, and the first five actors are surprised by the abrupt change at time .
Our simulation is discretized. Our unit time is , and in Figure LABEL:fig:experimentonesixactorpath, each tick in the horizontal axis corresponds to an integral multiple of . The jump term in our update formula is quite sensitive to the number of actors being considered. As such, for updating the jump term, we further discretized into subintervals for numerical stability of our update iterations. For , each unit interval is associated with subiterations, and the total number of the (main) iteration is , and we use