A Scalable MCEM Estimator for SpatioTemporal Autoregressive Models
Philipp Hunziker,^{1*} Julian Wucherpfennig,^{2} Aya Kachi,^{3} NilsChristian Bormann^{4}
1 Northeastern University
2 Hertie School of Governance
3 University of Basel
4 University of Exeter
* Corresponding author: hunzikp@gmail.com
Abstract
Very large spatiotemporal lattice data are becoming increasingly common across a variety of disciplines. However, estimating interdependence across space and time in large areal datasets remains challenging, as existing approaches are often (i) not scalable, (ii) designed for conditionally Gaussian outcome data, or (iii) are limited to crosssectional and univariate outcomes. This paper proposes an MCEM estimation strategy for a family of latentGaussian multivariate spatiotemporal models that addresses these issues. The proposed estimator is applicable to a wide range of nonGaussian outcomes, and implementations for binary and count outcomes are discussed explicitly. The methodology is illustrated on simulated data, as well as on weekly data of ISrelated events in Syrian districts.
1 Introduction
Over the last decade, there has been considerable growth in the availability of spatially indexed data. This is especially the case for eventtype data, which record the time and place of a specific type of occurrence (i.e. an ‘event’), and have become increasingly common across a range of subjects, including political violence, crime, epidemiology, traffic, geotagged social media data, cellphone activity, and more.
A common and natural approach for analyzing such data is to aggregate them onto a spatial lattice, that is, to map the events onto distinct tiles fully covering the study area.^{1}^{1}1An alternative approach which we do not address in this paper is to treat eventtype data as a spatial point process, see e.g. Baddeley, Rubak and Turner 2015. In lattice data, it is often of interest to estimate the presence and magnitude of spatial (inter)dependence, i.e. the degree to which the outcome of unit affects the outcomes of its neighbors and viceversa. Modeling such spatial interdependence is at the core of much of the spatial econometrics literature, with the spatial autoregressive (SAR) model being the workhorse tool for that purpose (see LeSage and Pace 2009; Anselin 2010). In its simplest and most widely applied form, the SAR model is designed for a conditionally normal outcome variable observed over a crosssection. However, empirical applications rarely meet this requirement. For one, many outcomes of interest are discrete. Importantly, this is nearalways the case if the outcome is constructed by aggregating eventtype data onto a lattice, which will typically result in a count or binary outcome. Moreover, as recording spatially indexed data becomes easier, researchers increasingly have access to timevariant data. In this setting, it is often desirable to structure the data as a panel, and to model temporal and spatial dependence simultaneously. And finally, many interesting problems involve more than one outcome, and researchers may be interested in studying their interaction while taking into account spatial dependence.
Extending the simple SAR model to address these complications is nontrivial. Perhaps most importantly, the standard estimation strategy for the normal SAR model – maximum likelihood estimation (MLE) – does not travel well to settings with nonnormal outcomes because the respective loglikelihood functions involve a highdimensional integrals that cannot be evaluated directly. In addition, implementing an estimation strategy for the SAR model that scales well with large datasets is challenging, regardless of the outcome distribution. Naive implementations often rely on conceptually simple but computationally costly matrix operations which make estimating large models impractical (see e.g. Bivand, Hauke and Kossowski 2013).
This paper addresses these challenges by proposing a Monte Carlo Expectation Maximization (MCEM) estimation strategy for a class of multivariate spatiotemporal autoregressive models for nonGaussian outcome variables. Specifically, we introduce an MCEM estimator for a family of models where a multivariate spatiotemporal autoregressive process operates on a latent variable with a conditionally Gaussian distribution; the observed outcomes are then assumed to be iid distributed conditional on the (transformed) latent variable. This setup permits the researcher to freely choose an appropriate outcome distribution for the type of data being modeled, such as a Poisson distribution for count data.
The proposed estimator features a number of benefits. First, it is applicable to a large class of nonGaussian outcomes. We discuss and provide implementations for binary and count data, but extending the estimator to other types of outcomes (e.g. censored or multinomial) is straightforward. Second, we take appropriate steps to ensure that our estimator is scalable, and thus applicable to large datasets. For univariate outcomes and settings where each unit has a constant number of spatial neighbors, our proposed implementation of the expected loglikelihood function has linear time complexity in the number of units and timeperiods. For models with multivariate outcomes, our implementation has approximately quadratic time complexity in the number of units, and linear time complexity in the number of periods. Finally, our MCEM estimator combines the benefits of a purely Bayesian (estimated e.g. via MCMC) and a purely frequentist (estimated e.g. via MLE or GMM) approach. Specifically, relying on Monte Carlo sampling during the step of the algorithm enables us to seamlessly impute missing values in the outcome in a fully Bayesian manner. This also permits generating outofsample predictions during model training. Simultaneously, the step of the algorithm ensures that our estimator converges much more quickly – and with less need for tuning – than a fully Bayesian approach.
This paper adds to the quickly growing literature that derives scalable estimators for modeling spatial dependence in nonGaussian outcomes, specifically Wang, Iglesias and Wooldridge (2013), Pace and LeSage (2016), and Liesenfeld, Richard and Vogler (2016). Apart from introducing a new estimation strategy, this paper supplements these previous efforts in at least three aspects. First, we provide a model that is suitable not only for spatiotemporal data, but also for multivariate outcomes. Second, our approach provides an efficient and theoretically satisfying solution to missing outcome data and outofsample prediction tasks. And third, we provide an efficient, tested ’blackbox’ implementation of the proposed estimator in the R programming language, making it easily accessible for applied researchers.^{2}^{2}2See https://github.com/hunzikp/dimstar.
The paper proceeds by providing an overview of the related literature. In Section 3, we then introduce the family of models we are interested in, followed by a detailed discussion of our MCEM estimation strategy in Section 4. Next, Section 5 evaluates our estimator on simulated data, showing favorable performance in terms of finite sample properties and scalability. Section 6 illustrates the methodology on observed data, presenting an analysis of weekly ISrelated incidences of political violence in Syrian districts.
2 Related Literature
The literature on spatial autoregressive models for nonGaussian outcomes is largely divided along outcome types. For one, a considerable literature discusses SAR models for binary data, in particular crosssectional Spatial Probit models, where the spatial autoregressive process operates on a latent Gaussian variable and is mapped onto the binary outcome via the typical Probitstyle indicator function (see LeSage and Pace 2009 for a textbook treatment). Perhaps the bestknown estimator for the Spatial Probit model is introduced by LeSage (2000), who derives a MetropolisHastingswithinGibbs sampling approach for estimating a fully Bayesian variant of the model for crosssectional data. Taking an alternative approach, Beron and Vijverberg (2004) propose a maximum simulated likelihood estimation strategy where the likelihood is approximated via recursive importance sampling (RIS). Franzese, Hays and Cook (2016) generalize the RIS approach to settings with spatial and temporal dependence. Yet another estimation strategy is pursued by Pinkse and Slade (1998), who derive a GMM estimator for the Spatial Probit model. This approach is extended by Klier and McMillen (2008), who introduce a scalable but inconsistent linearized version of Pinkse and Slade’s 1998 estimator.^{3}^{3}3More precisely, the Klier and McMillen linearized estimator produces inconsistent estimates for the spatial autoregressive parameter if absolute spatial dependence is high; see also the results reported by Calabrese and Elkink (2014).
Finally, an early approach that deserves special attention in the context of this paper is the estimator proposed by McMillen (1992). Similar to the present paper, McMillen considers the EM algorithm to perform maximum likelihood estimation, though he limits his focus to univariate crosssectional binary outcome data. Within these constraints, he derives a closedform EM estimator, that is, unlike the approach put forward in this paper, there is no need for Monte Carlo simulation. However, this comes at the cost of consistency. Specifically, McMillen makes a number of simplifying assumptions such that the final EM estimator does not maximize a valid likelihood. As a result, the McMillen estimator produces biased estimates, as shown in the simulation experiments of Calabrese and Elkink (2014).
Because the estimation approaches discussed so far do not generally scale well to large datasets, recent efforts in the literature have focused on deriving crosssectional Spatial Probit estimators with low computational overhead. Wang, Iglesias and Wooldridge (2013) propose a partial maximum likelihood estimator that simplifies the estimation problem by grouping observations into pairs. Similarly, Pace and LeSage (2016) improve scalability by proposing a variant of the RIS approach by Beron and Vijverberg (2004) that leverages sparse matrix methods.
Another class of nonGaussian spatial autoregressive models having received attention in the literature are SAR models for count data. In an excellent review, Glaser (2017), following the timeseries terminology of Cox et al. (1981), divides SAR count data models into two types: ’Observationdriven’ models where spatial autocorrelation operates on the observed count outcome, and ’parameterdriven’ models, where spatial autocorrelation operates on a latent variable. The estimator introduced in the present paper falls into the latter category.
Lambert, Brown and Florax (2010) introduce an observationdriven model for crosssectional data where the spatial autoregressive process operates on the logconditionalmean of a Poisson outcome. They propose an estimation strategy based on twostep limited information maximum likelihood. Conceptually similar models are introduced by Hays and Franzese (2009) and Bhati (2008). Apart from their focus on crosssectional data, these models only account for spatial dependence induced by the predictors, but not by the error terms (Glaser, 2017, p.9).
Among parameterdriven SAR count models, two deserve special attention.^{4}^{4}4See Glaser (2017) for a more complete overview of the literature. The first is the crosssectional model considered by Bhat, Paleti and Singh (2014), which, similar to the family of models considered in this paper, assumes a latentGaussian structure. Moreover, Bhat, Paleti and Singh (2014) also explicitly consider multivariate outcomes. However, in contrast to the approach pursued in the present paper, their formulation restricts outcome dependence to the error terms, whereas we allow for fully interdependent outcomes.
The second SAR count model worth highlighting is the one considered by Liesenfeld, Richard and Vogler (2016, 2017), who also focus on the latentGaussian setup. Moreover, similar to the present paper, Liesenfeld, Richard and Vogler (2016) do not focus exclusively on a single outcome type, but generalize their analysis to a wide range of outcome distributions, discussing count and binary outcomes specifically. Their key contribution is a new maximum simulated likelihood estimation procedure that extends the efficient importance sampling (EIS) algorithm by Richard and Zhang (2007) in a manner that makes it applicable to largescale spatial models. Moreover, in Liesenfeld, Richard and Vogler (2017), the authors generalize the count outcome version of their model to a panel setup, and consider temporal and spatial dependence.
In sum, despite being a very active strand of research, the literature on SAR type models for nonGaussian outcome data has only recently been able to catch up with the explosive growth in spatiotemporal data and the modeling challenges that accompany it. In particular, applied researchers currently have few choices for modeling spatiotemporal dependence in large, possibly multivariate nonGaussian response data. Most of the existing literature only considers univariate and/or crosssectional data, and pays limited attention to scalability. Perhaps more importantly, many of the estimation techniques proposed in the literature lack an easily accessible software implementation, thus further raising the bar for practitioners to employ these methods.
This paper addresses these shortcomings headon. By introducing a flexible and readily implemented MCEM algorithm for estimating spatiotemporal dependence in possibly multivariate outcome data, it complements and extends the recent efforts by Pace and LeSage (2016) and Liesenfeld, Richard and Vogler (2016).
3 The latentGaussian multivariate spatiotemporal autoregressive model
We propose a latentGaussian formulation for modeling multivariate spatiotemporal data over outcomes, units, and periods. Let be a column vector representing the th outcome at period . Then the class of models we consider is summarized by the following assumptions:
whereas is a rowstandardized spatial weights matrix, and , , and are parameters, each constrained to the interval, capturing spatial, temporal, and outcomedependence, respectively. To ensure identification, we also assume that . Hence, in a system with outcomes we estimate parameters. One may also add a (outcomespecific) matrix of predictors to to the above model (); we omit the term for ease of exposition.
While the abovenoted model specification is intuitive, it is often more practical to work with a formulation that omits outcome and periodwise subscripts, and is specified in reducedform. To this end, we concatenate the latentGaussian vectors first across outcomes, then across periods, yielding the following formulation:
(1)  
is a interdependence matrix taking the form
whereas
and
denotes the multivariate normal distribution, and
with
denotes the identity matrix. Note that if , the proposed model simplifies to the regular spatiotemporal autoregressive model (STAR) defined on the latentGaussian outcome . The proposed model is stationary if , which we ensure by requiring
whereas represents the sum over all possible pairs of outcomes involving outcome .
We link the latentGaussian outcome to the observed outcome by specifying an appropriate distribution . If the observed outcome is binary we propose the deterministic distribution function
which, together with the identifying assumption , amounts to a multivariate spatiotemporal Probit specification.
If the observed outcome is a count variable, we propose using a multivariate Poisson logNormal setup, i.e.
whereas dPois is the Poisson PMF.
4 Estimation
Let denote a vector of all parameters (). The fulldata loglikelihood of the model is then given by
(2) 
The log probability of the latentGaussian variable , in turn, is induced by the model specification in (1), and is given by
(3) 
whereas dMVN denotes the multivariate normal density.
MCEM algorithm
The Expectation Maximization (EM; Dempster, Laird and Rubin 1977) formalism estimates by maximizing the intractable marginal likelihood
via coordinate ascent, giving the following recursive algorithm:

Determine the expected value of the fulldata loglikelihood with respect to the latent variable and conditional on the current parameters . This expectation is typically called the function:

Obtain the updated set of parameters by maximizing the function:

Repeat until convergence.
In the following, we derive the function up to a constant. For ease of exposition, we drop all subscripts from the expectation operator.
Evaluating the expected kernel requires the first two moments of , which are generally not available in closed form. For this reason, we pursue an MCEM approach where we obtain an estimate of the kernel via sampling. One way to do this would be to (i) sample from , (ii) use these samples to estimate the first two moments of and then (iii) plugging these estimates into the function. However, this approach requires estimating the covariance matrix , which involves considerable computational overhead. Instead, we invoke the law of the unconscious statistician and estimate the kernel directly via the numerical integral
MC sampling
The MCEM algorithm requires sampling from
Sampling from this distribution directly is typically either unfeasible (e.g. in the countdata case) or prohibitively costly (e.g. in the binary case), which is why we propose using a Gibbs sampler to sample from
(4) 
whereas subscript enumerates a single data point, and denotes the entire vector except for . Note that the second term in expression (4) is a conditional multivariate normal distribution, and thus itself normal:
where , i.e. is the precision matrix of , as established in (3). Note that a fortunate implication of these derivations is that we do not have to invert to evaluate . We do have to invert , but because is diagonal, this may be performed in linear time. Similarly, is scalar, so obtaining its inverse is straightforward as well.
If the outcome is binary, sampling from the conditional distribution (4) thus amounts to sampling from the univariate truncated normal distribution,
where denotes the truncated Normal density with support in the interval .
If the outcome is a countvariable, sampling from the conditional distribution (4) requires sampling from the unnormalized Poisson logNormal posterior
To do so, we employ the adaptive rejection sampler by Wild and Gilks (1993), which allows highly efficient sapling from unnormalized densities as long as (i) the first derivative with respect to the sampling variable is available, and (ii) the density is logconcave. For the present application, it is trivial to show that these requirements are met. The logarithm of the Poisson logNormal posterior is proportional to
with first and second derivatives
Finally, note that using a Gibbs sampler to sample the latent variable provides a natural strategy for imputing missing data. Specifically, if is missing, we simply draw from its conditional prior,
during the step of the MCEM algorithm. This yields a complete set of samples, allowing us to proceed with the step without further modifications. This approach is particularly useful for generating outofsample predictions or forecasts. We simply add the observations to be predicted to the dataset on which the model is being trained, and set corresponding outcome observations to missing. The MCEM algorithm will then produce corresponding samples from which predictions can be obtained.
Efficient evaluation of the Q function
A central part of the MCEM algorithm is the maximization of the function, given by
Because the function is evaluated repeatedly during optimization, scalability requires that can be evaluated efficiently even for very large datasets. In the following, we briefly discuss our computational setup for evaluating the three constituent terms of the function as efficiently as possible.
Expected kernel
We ensure efficient evaluation of the approximate expected kernel by storing as a sparse matrix, and using sparse matrix multiplication for evaluating each kernel sample. This yields a complexity that is linear in , whereas is the number of nonzero entries in .^{5}^{5}5Because is diagonal, the matrix product with has complexity . If is such that each unit has spatial neighbors, then
Hence, assuming constant as increases (which is typically the case in lattice data), then the complexity of evaluating the approximate expected kernel is . In other words, it is linear in the number of units, timeperiods, and MC samples, and quadratic in the number of outcomes.
Logdeterminant of error variance
Recall that the error covariance matrix, , is assumed diagonal. It follows that
which has complexity .
Logdeterminant of
The biggest impediment towards an efficient implementation of the function is the computation of the logdeterminant of the spatiotemporal multiplier, . Standard algorithms for computing the determinant of arbitrary martrices have cubic complexity, which is clearly prohibitive when is large. There is an extensive literature discussing this issue for the simple spatial autoregressive case where , see e.g. Bivand, Hauke and Kossowski (2013). In our case, however, also contains temporal and outcomespecific lags, making many of the standard approaches for evaluating the logdeterminant – e.g. Ord’s (1975) method – impractical.
We address this issue by showing that, thanks to the particular structure of , it is actually not necessary to compute the logdeterminant of directly. More precisely, in the following, we introduce what we term the STAR LogDeterminant theorem, which establishes that computing the logdeterminant of the matrix only requires numerical evaluation of the determinant of a smaller matrix of size .
We start by stating the following results:
Lemma 1.
The inverse of a blockdiagonal matrix is the blockdiagonal matrix of block inverses. E.g.
Lemma 2.
The left or right matrix product of a blockdiagonal matrix with a matrix that is lowerblockdiagonal with respect to is itself lowerblocktriangular with respect to . E.g. let
Then
and equivalently for .
Lemma 3.
From Sylvester’s identity (see e.g. Akritas, Akritas and Malaschonok, 1996) it follows that if is an invertible matrix and is a matrix, then
We now posit the following:
Theorem 1 (STAR LogDeterminant).
Let be the inverse spatiotemporal multiplier as defined in (1). We may decompose as follows,
with
with and both of size . Then the logdeterminant of is given by
Proof.
Invoking Lemma 3 we can write
From Lemmas 1 and 2 we can derive that is lowerblockdiagonal with respect to . Consequently, is lowertriangular with a unit diagonal. Because the determinant of triangular matrices is the product of all diagonal entries, it follows that
is blockdiagonal with blocks of . Since the determinant of a blockdiagonal matrix is the product of the block determinants, we have
∎
Given the STAR LogDeterminant Theorem the problem of evaluating reduces to finding an efficient way to evaluate . We note that if (i.e. only a single outcome is modeled), then , and thus any of the various methods proposed for evaluating the logJacobian of the spatial autoregressive model may be used (see Bivand, Hauke and Kossowski 2013). One popular strategy for this setting is Ord’s method (1975). For cases where is large and Ord’s method is impractical, Smirnov and Anselin (2009) propose a method that has complexity . In our implementation of the MCEM estimator, we employ the same strategy as Wilhelm and de Matos (2013) and precompute for a range of values prior to optimization using a sparse decomposition, and simply look up the value of the logdeterminant during optimization.
If , we propose using a sparse Cholesky decomposition for computing , similar to the approach proposed by Pace and Barry (1997) for the regular spatial autoregressive case. Using the sparse Cholesky decomposition raises a problem: The Cholesky decomposition is only applicable to symmetric, positivedefinite matrices, but there is no guarantee of meeting either of these requirements. Indeed, if is rowstandardized, then will not be symmetric. We address this problem by relying on the following ‘trick’ suggested by Pace and Barry (1997):
whereas is both positivedefinite and symmetric.
The computational complexity of the sparse Cholesky decomposition is not well defined, and depends on the fillpattern of the matrix to be decomposed. However, we show in the simulations below that if is built from a regular lattice grid and the number of neighbors per unit remains constant, then the complexity is approximately .
Optimization
At each MCEM iteration, we need to solve the following constrained optimization problem:
—l— θQ(θ— θ^m) \addConstraint—∑(j,k) λj,k + ρj + γj—¡1 \addConstraintσ^2_j¿0 \addConstraint—ρ_j—¡1 \addConstraint—γ_j—¡1 \addConstraint—λ_j,k—¡1, ∀ j ∈G ∧ pairs(j,k) ∈G.
We enforce the constraints on the scalar dependence parameters (, , ) by optimizing with respect to the inverse hyperbolic tangent (e.g. ) of the parameters, and mapping these transformed values onto the interval during the evaluation of the function. Analogously, we optimize with respect to the inversesoftplus transformed variance parameters.^{6}^{6}6The softplus function (also known as the smooth linear rectifier) is given by . The stationarity constraint we enforce by employing a constrained optimization algorithm, namely the general nonlinear augmented Lagrange multiplier method solver by Ye (1988).
Standard Errors
We estimate standard errors for using an MC approximation to Louis’ (1982) method, similarly to Natarajan, McCulloch and Kiefer (2000). Louis gives the following identity for the observed Fisher information matrix in the EM framework,
whereas and are the Hessian and gradient of the fulldata loglikelihood function (2), respectively. We obtain estimates of the expectation (and, respectively, variance) of these quantities via simulation and numerical differentiation.
5 Simulation Results
Bias, RMSE, and Coverage
The goal of this first simulation study is to evaluate the finite sample properties of our MCEM estimator, and to validate our implementation of the underlying algorithms.
We generate simulated data from the following specification,
whereas each consists of a unit constant, and a single covariate drawn from . The true parameters are held constant at and . The spatial weights matrix is constructed from a regular square grid with side length . We perform two sets of experiments, one with a small sample size of , and one with a larger sample size of . Each experiment is repeated 50 times.^{7}^{7}7All experiments reported in this subsection involve count outcomes. Results for a set of equivalent experiments with binary outcome data are available in the Appendix, with largely identical findings.
For each simulated dataset, we use our MCEM estimator to recover point and standard error estimates for the full parameter vector . We use a MC sample size of for the MCEM algorithm, and a sample size of for estimating the Fisher Information matrix. We run the MCEM algorithm for 50 iterations, unless the maximum absolute change across all estimated parameters is smaller than in any previous iteration.
Figure 1 summarizes the distribution of the parameterwise estimation error () over all simulation draws, for both sets of experiments. Clearly, the MCEM estimator is able to recover the true parameter values, and the decrease in bias and RMSE (root mean square error) with larger sample size suggests consistency. We note that the recovered estimates have low RMSE (compared to the true parameter values) even though the models were estimated with a relatively small MC sample size. Increasing the MC sample size would likely yield a further reduction in RMSE.
Next, Figure 2 shows the coverage probability of the 90% confidence intervals constructed from the estimated standard errors. Again, we observe that the standard error estimates appear to be unbiased; in no instance can we reject the null that the true coverage probability is 90%.
Time Complexity of Expected Likelihood Evaluation
The most important step towards a scalable implementation of proposed estimator is ensuring that the evaluation of the expected loglikelihood (or function) in the step of the MCEM algorithm is as efficient as possible. In the following set of simulation experiments, we evaluate how well our implementation of the expected loglikelihood scales in the number of units (), time periods () and outcomes (). This exercise serves two purposes: first, to validate the analytical complexity estimates derived in Section 4, and second, to obtain empirical complexity estimates for datasets where . Recall that in the latter case, we were unable to derive complexity analytically because of the sparse Cholesky decomposition involved.
For all experiments, we sample instances of the latentGaussian using the same data generating process, true parameter values, and spatial weights matrix as discussed in the previous section. Each experiment consists of evaluating the expected loglikelihood five times at the true parameter values; we report the total CPU time accumulated during those five evaluations. We repeat the experiment for different dataset sizes, varying , and separately.
Figure 3 summarizes the results. Panel 2(a) shows CPU time as a function of , for and . As expected, performance is linear in for the univariate outcome case (), except for very small . When , we find that performance is approximately quadratic. As discussed extensively in Section 4, this is because if , we need to perform a sparse Cholesky decomposition of a matrix of size to evaluate . While the complexity of this operation is problem dependent, the results presented here suggest that it is much more efficient than the regular (dense) Cholesky decomposition, which has cubic complexity.
Next, Panel 2(b) shows CPU time as a function of , with and . As expected, we find that performance is linear in , even for . The latter property is thanks to the STAR LogDeterminant theorem, which permits efficient evaluation of regardless of .
Finally, Panel 2(c) shows changes in CPU time as we vary , with and . As anticipated in Section 4, we find that performance is quadratic in .
In summary, the cost of evaluating the expected loglikelihood is linear in the number of units and timeperiods if the outcome is univariate, and roughly quadratic in the number of units if the outcome is multivariate. Also, evaluating the expected loglikelihood becomes quadratically more expensive as the number of modeled outcomes increases. While these findings do not translate into total runtimes for the full MCEM algorithm directly (which depends on the problem at hand and the convergence criterion employed), they demonstrate that our estimator is far more efficient than a ‘naive‘ approach that relies on direct evaluation of the logdeterminant and dense matrix algebra.
Spatial Probit Comparison
In a final set of simulation experiments we compare our MCEM estimator against two alternative approaches for estimating a Spatial Probit model, i.e. a setup where the outcome is binary and the data univariate and crosssectional (i.e. ). We focus on this setting because it allows us to compare our estimator against a pair of readily implemented and widely used alternatives. The first is the Bayesian Spatial Probit model, estimated via MCMC, as proposed by LeSage (2000) and implemented in the R programming language by Wilhelm and de Matos (2013). The second is the GMM estimator for Spatial Probit models, introduced by Pinkse and Slade (1998), and improved and implemented in R by Klier and McMillen (2008).
We perform two sets of experiments. In the first set, we draw data from a Spatial Probit data generating process with a design matrix composed of a constant and a single predictor , and . The spatial weights matrix is constructed from a regular square grid with side length . We vary the sample size (), as well as the magnitude of spatial autocorrelation (). We repeat each experiment 30 times.
For each simulated dataset, we use our MCEM estimator to recover point estimates for and . We use a MC sample size of , and run the MCEM algorithm for 75 iterations, unless the maximum absolute change across all estimated parameters is smaller than in any previous iteration. Correspondingly, we also recover point estimates using the MCMC and the GMM estimator. For MCMC sampling we rely on the default settings provided in the R package by Wilhelm and de Matos (2013).
To establish whether the estimators are able to accurately recover the true parameters, we evaluate the distribution of the estimation errors and . Figure 4 plots the corresponding results. We find that (i) all three methods are unbiased and have comparable RMSE if is sufficiently high, (ii) our estimator exhibits the smallest bias/RMSE in if is small, and (iii) GMM estimation fails for large and .
In a second set of experiments, we simulate data from the same data generating process, but fix and only vary . The goal here is to evaluate how well the three estimators scale in terms of estimation time. The results of these experiments are reported in Figure 5. We find that estimation time (i) increases cubically for the implementation, (ii) increases superlinearly for the MCMC implementation, and (iii) – as anticipated – increases linearly for our estimator.
6 Application: IS Violence in Syria
In this section, we apply the proposed estimator to a spatiotemporal model of Islamic State (IS)related violence in Syrian districts between January 2017 and June 2018. In particular, we investigate whether the occurrence of airstrikes targeting IS forces serves as an advance signal of ISrelated violence in the following week. Methodologically, the application demonstrates how accounting for spatiotemporal dependence may affect substantive conclusions, the use of elasticities for interpretation, and the use of outofsample predictions to evaluate model fit.
The application speaks to a growing number of studies in Political Science and Economics that analyze political violence using eventlevel data (see Schrodt 2012 for a review). It is widely acknowledged that political violence exhibits considerable spatial and temporal dependence, as contested areas often shift and expand over the course of a conflict (e.g. Weidmann and Ward, 2010). However, many analyses of political violence account for spatial and temporal dependence only via standard error adjustments, not least because appropriate modeling techniques for nonGaussian outcomes have long been unavailable (e.g. Pierskalla and Hollenbach, 2013; Buhaug et al., 2011). Moreover, the literature suggests that there is considerable interdependence between different forms of political violence, especially during civil wars, where violence against civilians and battlefield violence between combatants are closely linked (Kalyvas, 2006). By using our estimator, we are able to fit a model that accounts for spatiotemporal dependence, and is able to capture interdependence between violence against civilians and battlefield violence directly.
We analyze a panel dataset containing weekly observations for all 68 Syrian administrative districts for the 78 calendar weeks between January 8th 2017 and June 23rd 2018, totaling 5168 observations. We study two count outcomes: (1) The number of battleevents involving IS forces per districtweek, and (2) the number of instances where IS forces perpetrated violence against unarmed civilians per districtweek. Data for both variables were constructed from the ACLED dataset (Raleigh et al., 2010), which provides eventlevel data of political violence in Syria beginning in 2017. We include a total of five predictors. The first four are timeinvariant geographic/demographic variables that are known to correlate with the location of political violence (Tollefsen and Buhaug, 2015): The area of a district; its population in 2000; the average travel time to the next city in hours, averaged over all locations in a district; and the proportion of the district area that is covered by mountains. All variables but the population count originate from the PRIOGrid dataset (Tollefsen, Strand and Buhaug, 2012). The population data are obtained from CIESIN (2010). The fifth predictor is a lagged count of the number of airstrikes targeted at IS positions in a given district. This is the sole timevariant predictor, and is included under the hypothesis that airstrikes may reveal information about where future fighting is going to occur. This variable is also constructed from the ACLED dataset. All predictors except for the proportion of mountainous terrain are logged.^{8}^{8}8We add a unit constant to the airstrikes variable prior to taking the natural logarithm. Figure 6 summarizes the spatial and temporal distribution of the two outcomes, as well as the airstrikes predictor.
We estimate three models. First, a latentGaussian count model without spatial or temporal dependence parameters, estimated for both outcomes separately (Model 1). Second, a latentGaussian count model with spatiotemporal dependence parameters, again estimated separately for each outcome (Model 2). And third, a multivariate latentGaussian count model with spatiotemporal dependence parameters, estimated for both outcomes jointly, and including a parameter for capturing outcome dependence (Model 3). All models are estimated using our MCEM estimator, using 50 MC samples for parameter estimation and 100 MC samples for standard error estimation. Each model is run for 15 iterations, and convergence is ensured by checking the largest absolute change across all parameter estimates after each iteration, ensuring that the series is stationary.
For each model and outcome, we report point and standard error estimates, direct elasticities, and spillover elasticities. Direct elasticities indicate the increase in the expected outcome for unit , in percent, as a result of a one percent increase in the respective predictor for unit . For Model 1 the direct elasticities are equivalent to the point estimates.^{9}^{9}9Note that for the Mountains predictor the reported value is actually a semielasticity because the variable is not logged. For Models 2 and 3 the reported direct elasticities are averaged over all observations, as nonzero spatial or outcome dependence leads to unitspecific elasticities due to the spatial/outcomespecific feedback effect. Finally, the spillover elasticities indicate the relative increase in the sum of expected outcomes over all units in period , in percent, as a result of increasing the predictor of unit in period by 1 percent. Naturally, all spillover elasticities are zero for Model 1.
Model 1  Model 2  Model 3  
Coef.  Direct Elast.  Spillover  Coef.  Direct Elast.  Spillover  Coef.  Direct Elast.  Spillover  
Battles  
Airstrikes (log, lag)  
Area (log)  
Population (log)  
Remoteness (log)  
Mountains  
Violence Against Civilians  
Airstrikes (log, lag)  
Area (log)  
Population (log)  
Remoteness (log)  
Mountains  
Obs.  5168  5168  5168  
* , ** , ***  
Note: Standard errors in parentheses. All elasticities are averaged over units and timeperiods. 
Model 1  Model 2  Model 3  
RMSE  MAE  RMSE  MAE  RMSE  MAE  
Battles  
Viol. agst. Civilians  
* Smallest value per outcome. 
Table 1 summarizes the results. Three findings are worth highlighting. First, airstrikes appear to be a useful predictor of both battle events as well as violence against civilians. However, the effect size decreases drastically once we take spatiotemporal dependence into account, namely from a direct elasticity of 1.86/0.95 to 0.18/0.41. This is likely because Model 1 falsely attributes temporal autocorrelation in the outcomes to the airstrikes variable. Still, the predictive power of airstrikes is substantial. For instance, a doubling of the number of airstrikes observed in a given district is expected to increase the number of events involving violence against civilians in the same district by between 32% (Model 2) and 41% (Model 3) in the following week. Second, we find little evidence of spatial dependence in battles, but some evidence for spatial autocorrelation for the violence against civilians outcome. For instance, a doubling of the number of airstrikes in a district leads to between 8.6% (Model 2) and 5.9% (Model 3) more instances of violence against civilians across all other districts in the following week. Third, surprisingly, we find only very moderate dependence between the two outcomes, with the estimate in Model 3 statistically significant, but close to zero.
Finally, we compare the models’ goodness of fit via their outofsample predictive power. Specifically, we reestimate all models while censoring the outcome value for an arbitrary 33% of all observations. Thanks to the MC step of our MCEM estimator, predictions for the omitted outcomes are generated automatically as a sideproduct of the estimation procedure. Table 2 reports root mean squared prediction error (RMSE) and mean absolute prediction error (MAE) for all models. Clearly, accounting for spatiotemporal dependence leads to much better model fit, with both error metrics being smaller in Models 2 and 3 than in Model 1. Moreover, the multivariate model (Model 3) yields a slightly better fit for the violenceagainstcivilians outcome, but not for the battle outcome.
7 Concluding Remarks
The goal of the work reported here was to introduce a scalable method for estimating models of spatiotemporal dependence in nonGaussian and potentially multivariate outcomes. We addressed this task by proposing a Monte Carlo Expectation Maximization (MCEM) procedure for estimating the parameters of a family of latentGaussian spatiotemporal autoregressive models, discussing the cases of binary and count outcomes explicitly. Relying on an MCEM approach avoids having to evaluate the intractable likelihood of the latent Gaussian spatiotemporal model, and offers a straightforward strategy for generating outofsample predictions during model training.
We confirmed through simulation experiments that our estimator yields consistent parameter and standard error estimates in finite samples. We also showed that our proposed estimator scales well with large datasets, both analytically and in simulations. In the case of a univariate outcome and a fixed number of spatial neighbors per unit, our implementation of the expected loglikelihood function has linear time complexity in the number of units and time periods. For multivariate outcomes, the time complexity is approximately quadratic in the number of units. Further, the simulation experiments in Section 5 highlighted that our estimator compares favorably against other options for estimating the Spatial Probit model, both in terms of error and estimation time.
Finally, we illustrated the usefulness of our estimator by examining weekly counts of ISrelated violence in Syria. We demonstrated the use of elasticities for interpretation, and how outofsample predictions may be employed to assess model fit. Substantively, we found that airstrikes may serve as a useful advance signal for predicting the occurrence of civilwar related violence.
The work presented here may be usefully extended in several directions. While we limit our discussion to binary and conditionally Poisson distributed count outcome data, there are various other outcome distributions that may be of interest. In particular, using the Negative Binomial may be necessary for count outcomes that are highly overdispersed, and thus inadequately modeled by the Poisson logNormal setup discussed in this paper (see also Liesenfeld, Richard and Vogler, 2016). Another potentially fruitful avenue for further research may be extending our MCEM procedure such that the MC sample size is determined dynamically during estimation, e.g. using the method introduced by Caffo, Wolfgang and Jones (2015). This would likely lead to estimates exhibiting less variance, potentially speed up estimation, and provide a more appropriate strategy for assessing convergence.
Acknowledgments
We would like to thank Frederick Boehmke, LarsErik Cederman, Robert Franzese Jr., Dennis Quinn, Andrea Ruggeri, Emily Schilling, Martin Steinwand, Oliver Westerwinter, Michael Ward, Christopher Zorn, and Bruce Desmarais for their useful comments on previous versions of this paper.
References
 (1)
 Akritas, Akritas and Malaschonok (1996) Akritas, Alkiviadis G, Evgenia K Akritas and Genadii I Malaschonok. 1996. “Various proofs of Sylvester’s (determinant) identity.” Mathematics and Computers in Simulation 42(46):585–593.
 Anselin (2010) Anselin, Luc. 2010. “Thirty years of spatial econometrics.” Papers in Regional Science 89(1):3–25.
 Baddeley, Rubak and Turner (2015) Baddeley, Adrian, Ege Rubak and Rolf Turner. 2015. Spatial point patterns: methodology and applications with R. CRC Press.
 Beron and Vijverberg (2004) Beron, Kurt J and Wim P M Vijverberg. 2004. Probit in a Spatial Context: A Monte Carlo Analysis. In Advances in Spatial Econometrics: Methodology, Tools and Applications, ed. Luc Anselin, Raymond J G M Florax and Sergio J Rey. Berlin: Springer pp. 169–195.
 Bhat, Paleti and Singh (2014) Bhat, Chandra R, Rajesh Paleti and Palvinder Singh. 2014. “A spatial multivariate count model for firm location decisions.” Journal of Regional Science 54(3):462–502.
 Bhati (2008) Bhati, Avinash Singh. 2008. “A generalized crossentropy approach for modeling spatially correlated counts.” Econometric Reviews 27(46):574–595.
 Bivand, Hauke and Kossowski (2013) Bivand, Roger, Jan Hauke and Tomasz Kossowski. 2013. “Computing the Jacobian in Gaussian spatial autoregressive models: an illustrated comparison of available methods.” Geographical Analysis 45(2):150–179.
 Buhaug et al. (2011) Buhaug, Halvard, Kristian Skrede Gleditsch, Helge Holtermann, Gudrun Østby and Andreas Forø Tollefsen. 2011. “It’s the local economy, stupid! Geographic wealth dispersion and conflict outbreak location.” Journal of Conflict Resolution 55(5):814–840.
 Caffo, Wolfgang and Jones (2015) Caffo, Brian S., Jank Wolfgang and Galin L. Jones. 2015. “Ascentbased Monte Carlo expectation– maximization.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):235–251.
 Calabrese and Elkink (2014) Calabrese, Raffaella and Johan A Elkink. 2014. “Estimators of binary spatial autoregressive models: A Monte Carlo study.” Journal of Regional Science 54(4):664–687.
 CIESIN (2010) CIESIN. 2010. “Global RuralUrban Mapping Project (GRUMP), v1.” Online: http://sedac.ciesin.columbia.edu/data (accessed 06/10/2018) .
 Cox et al. (1981) Cox, David R, Gudmundur Gudmundsson, Georg Lindgren, Lennart Bondesson, Erik Harsaae, Petter Laake, Katarina Juselius and Steffen L Lauritzen. 1981. “Statistical analysis of time series: Some recent developments [with discussion and reply].” Scandinavian Journal of Statistics pp. 93–115.
 Dempster, Laird and Rubin (1977) Dempster, A. P., N. M. Laird and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39(1):1–38.
 Franzese, Hays and Cook (2016) Franzese, Robert J, Jude C Hays and Scott J Cook. 2016. “Spatialand spatiotemporalautoregressive probit models of interdependent binary outcomes.” Political Science Research and Methods 4(1):151–173.
 Glaser (2017) Glaser, Stephanie. 2017. “A review of spatial econometric models for count data.” Unpublished Working Paper, University of Hohenheim .
 Hays and Franzese (2009) Hays, Jude C and Robert J Franzese. 2009. “A comparison of the smallsample properties of several estimators for spatiallag count models.” Unpublished Working Paper, University of Illinois ChampaignUrbana .
 Kalyvas (2006) Kalyvas, Stathis N. 2006. The Logic of Violence in Civil War. Cambridge University Press.
 Klier and McMillen (2008) Klier, Thomas and Daniel P McMillen. 2008. “Clustering of auto supplier plants in the United States: generalized method of moments spatial logit for large samples.” Journal of Business & Economic Statistics 26(4):460–471.
 Lambert, Brown and Florax (2010) Lambert, Dayton M, Jason P Brown and Raymond JGM Florax. 2010. “A twostep estimator for a spatial lag model of counts: Theory, small sample performance and an application.” Regional Science and Urban Economics 40(4):241–252.
 LeSage (2000) LeSage, James P. 2000. “Bayesian estimation of limited dependent variable spatial autoregressive models.” Geographical Analysis 32(1):19–35.
 LeSage and Pace (2009) LeSage, James and Robert Kelley Pace. 2009. Introduction to spatial econometrics. Chapman and Hall/CRC.
 Liesenfeld, Richard and Vogler (2016) Liesenfeld, Roman, JeanFrançois Richard and Jan Vogler. 2016. Likelihood evaluation of highdimensional spatial latent Gaussian models with nonGaussian response variables. In Spatial Econometrics: Qualitative and Limited Dependent Variables. Somerville: Emerald Group Publishing Limited pp. 35–77.
 Liesenfeld, Richard and Vogler (2017) Liesenfeld, Roman, JeanFrançois Richard and Jan Vogler. 2017. “LikelihoodBased Inference and Prediction in SpatioTemporal Panel Count Models for Urban Crimes.” Journal of Applied Econometrics 32(3):600–620.
 Louis (1982) Louis, Thomas A. 1982. “Finding the observed information matrix when using the EM algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) pp. 226–233.
 McMillen (1992) McMillen, Daniel P. 1992. “Probit with spatial autocorrelation.” Journal of Regional Science 32(3):335–348.
 Natarajan, McCulloch and Kiefer (2000) Natarajan, Ranjini, Charles E McCulloch and Nicholas M Kiefer. 2000. “A Monte Carlo EM method for estimating multinomial probit models.” Computational Statistics & Data Analysis 34(1):33–50.
 Ord (1975) Ord, Keith. 1975. “Estimation methods for models of spatial interaction.” Journal of the American Statistical Association 70(349):120–126.
 Pace and LeSage (2016) Pace, R Kelley and James P LeSage. 2016. Fast simulated maximum likelihood estimation of the spatial probit model capable of handling large samples. In Spatial Econometrics: Qualitative and Limited Dependent Variables. Somerville: Emerald Group Publishing Limited pp. 3–34.
 Pace and Barry (1997) Pace, R Kelley and Ronald Barry. 1997. “Sparse spatial autoregressions.” Statistics & Probability Letters 33(3):291–297.
 Pierskalla and Hollenbach (2013) Pierskalla, Jan H and Florian M Hollenbach. 2013. “Technology and collective action: The effect of cell phone coverage on political violence in Africa.” American Political Science Review 107(2):207–224.
 Pinkse and Slade (1998) Pinkse, Joris and Margaret E Slade. 1998. “Contracting in space: An application of spatial statistics to discretechoice models.” Journal of Econometrics 85(1):125–154.
 Raleigh et al. (2010) Raleigh, Clionadh, Andrew Linke, Håvard Hegre and Joakim Karlsen. 2010. “Introducing ACLED: an armed conflict location and event dataset: special data feature.” Journal of Peace Research 47(5):651–660.
 Richard and Zhang (2007) Richard, JeanFrancois and Wei Zhang. 2007. “Efficient highdimensional importance sampling.” Journal of Econometrics 141(2):1385–1411.
 Schrodt (2012) Schrodt, Philip A. 2012. “Precedents, progress, and prospects in political event data.” International Interactions 38(4):546–569.
 Smirnov and Anselin (2009) Smirnov, Oleg A and Luc E Anselin. 2009. “An O (N) parallel method of computing the LogJacobian of the variable transformation for models with spatial interaction on a lattice.” Computational Statistics & Data Analysis 53(8):2980–2988.
 Tollefsen and Buhaug (2015) Tollefsen, Andreas Forø and Halvard Buhaug. 2015. “Insurgency and Inaccessibility1.” International Studies Review 17(1):6–25.
 Tollefsen, Strand and Buhaug (2012) Tollefsen, Andreas Forø, Håvard Strand and Halvard Buhaug. 2012. “PRIOGRID: A unified spatial data structure.” Journal of Peace Research 49(2):363–374.
 Wang, Iglesias and Wooldridge (2013) Wang, Honglin, Emma M Iglesias and Jeffrey M Wooldridge. 2013. “Partial maximum likelihood estimation of spatial probit models.” Journal of Econometrics 172(1):77–89.
 Weidmann and Ward (2010) Weidmann, Nils B and Michael D Ward. 2010. “Predicting conflict in space and time.” Journal of Conflict Resolution 54(6):883–901.
 Wild and Gilks (1993) Wild, Pascal and WR Gilks. 1993. “Algorithm AS 287: Adaptive rejection sampling from logconcave density functions.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 42(4):701–709.
 Wilhelm and de Matos (2013) Wilhelm, Stefan and Miguel Godinho de Matos. 2013. “Estimating Spatial Probit Models in R.” R Journal 5(1).
 Ye (1988) Ye, Yinyu. 1988. Interior algorithms for linear, quadratic, and linearly constrained convex programming. PhD thesis Stanford University.