Predicting Ambulance Demand:
a SpatioTemporal Kernel Approach
Abstract
Predicting ambulance demand accurately at fine time and location scales is critical for ambulance fleet management and dynamic deployment. Largescale datasets in this setting typically exhibit complex spatiotemporal dynamics and sparsity at high resolutions. We propose a predictive method using spatiotemporal kernel density estimation (stKDE) to address these challenges, and provide spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals. Specifically, we weight the spatial kernel of each historical observation by its informativeness to the current predictive task. We construct spatiotemporal weight functions to incorporate various temporal and spatial patterns in ambulance demand, including locationspecific seasonalities and shortterm serial dependence. This allows us to draw out the most helpful historical data, and exploit spatiotemporal patterns in the data for accurate and fast predictions. We further provide efficient estimation and customizable prediction procedures. stKDE is easy to use and interpret by nonspecialized personnel from the emergency medical service industry. It also has significantly higher statistical accuracy than the current industry practice, with a comparable amount of computational expense.
Predicting Ambulance Demand:
a SpatioTemporal Kernel Approach
Zhengyi Zhou 
Center for Applied Mathematics 
Cornell University 
Ithaca, NY 14853 
zz254@cornell.edu 
David S. Matteson 
Department of Statistical Science 
Cornell University 
Ithaca, NY 14853 
matteson@cornell.edu 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors H.2.8 [Database Management]: Database Applications—Data mining ; G.3 [Probability and Statistics]: Nonparametric statistics

kernel density estimation; nonhomogeneous Poisson point process; emergency medical service
The emergency medical service (EMS) industry aims to minimize response times to emergencies and keep operational costs low. Accurate spatiotemporal demand predictions are critical to staff / fleet management and dynamic deployment. Predictions are required at high temporal and spatial resolutions for operational decisionmaking; the industry typically predicts for every hour and every 1 km region. We are motivated to predict spatiotemporal ambulance demand in Toronto, Canada.
There are several typical challenges to predicting ambulance demand. First, ambulance demand data is often sparse at the prediction resolution. For instance, Toronto receives only 23 highest priority events per hour on average; 96% of the 1 km spatial regions in Toronto have zero event in any hour. Also, ambulance demand frequently exhibits complex temporal dynamics, some of which are locationspecific. For example, ambulance demand in Toronto has weekly and daily seasonality and shortterm serial dependence of a few hours [?, ?]. Among these patterns, daily seasonality and shortterm dependence are more pronounced downtown and in denser neighborhoods [?]. Lastly, ambulance demand data in large cities is usually largescale; Toronto dispatches for about 200,000 priority events per year. This may present computational challenges, especially since predictions are needed hourly.
The current industry practice for predicting ambulance demand often uses a simple averaging formula. Demand in a 1 km spatial region over an hour is typically predicted by averaging a small number of historical counts, from the same spatial region and over the corresponding hours from previous weeks or years [?]. In current practice, Toronto EMS averages four historical counts in the same hour of the year over the past four years, while the EMS of CharlotteMecklenburg, North Carolina averages twenty historical counts in the same hour of the preceding four weeks for the past five years (the MEDIC method) [?]. Averaging so few historical counts, which are mostly zeros, produces highly noisy and flickering predictions, resulting in haphazard and inefficient deployment. Such methods are also sensitive to how the spatial domains are partitioned [?].
Many studies have accurately predicted aggregate ambulance demand as a temporal process. These studies have considered autoregressive moving average models [?], factor models [?] and spectrum analysis [?]. However, very few studies have modeled spatiotemporal ambulance demand well. One study uses artificial neural networks (ANN) to predict ambulance demand on discretized time and space, but fails to improve predictive accuracy over typical industry practice due to data sparsity [?]. Another more recent study predicts ambulance demand in discrete time and continuous space [?]. It proposes a timevarying Gaussian mixture model (GMM) to incorporate locationspecific temporal patterns in the demand, giving higher predictive accuracy than industry approaches. However, its estimation procedure via Bayesian sampling may present computational challenges and require special expertise, making it difficult to use by nonspecialized personnel from the EMS industry.
Kernel density estimation (KDE) is a powerful tool for nonparametric density estimation in spatiotemporal data. It has been widely applied to visualize or forecast spatiotemporal crime incidence [?, ?], disease spread [?, ?], product demand [?], and data streams [?, ?]. In most cases, the time dimension is treated differently from the space dimension(s). The most traditional approach is to build a separate spatial KDE for each discrete time period. However, this approach may result in uneven subset size and sparse subsets with too little data for accurate density estimates. Recent studies assume a multiplicative orthogonal relationship between the time and space dimensions. For example, [?] multiplies a spatial kernel with a linear function in time. Studies such as [?, ?, ?] multiply a spatial kernel and a temporal kernel with different bandwidths and kernel functions for the two kernels.
Extending these multiplicative spatiotemporal KDE methods, we propose a fast and accurate method for predicting spatiotemporal ambulance demand that is practical and scalable in industrial settings. We follow [?] in predicting in discrete time and continuous space. We propose a novel specification of spatiotemporal kernel density estimation (stKDE). First, we learn parametrically the temporal and spatial characteristics of the demand. Each historical observation is annotated with a weight based on what we have learned. This spatiotemporal weight function scales how helpful different historical observations are to a given predictive task. Then we construct a spatial kernel density estimator weighted by the informativeness weight function, and use the resulting kernel density estimates as predictions. In this way, we efficiently emphasize the historical data most important to prediction and, as far as possible, exploit the spatial and temporal characteristics in the data. This approach has three main advantages

accessibility: stKDE is fully automated and robust. It is easy to interpret and use by nonspecialized personnel, while approaches such as ANN or GMM may need special expertise (e.g., tuning, MCMC diagnostics).

efficiency: stKDE has low computational complexity. It is faster than GMM; inferring latent component label in GMM is costly. It can afford more frequent parameter estimation updates and online predictions.

accuracy: stKDE gives more accurate predictions than current industry practice with similar computational expense. It also outperforms naive KDE methods (and ANN via [?]); it is at least as accurate as GMM.
We implement this method on ambulance demand data from Toronto, Canada from years 2007 and 2008. The data consist of priority emergency events received by Toronto EMS for which an ambulance was dispatched. Each record contains the time and the location to which the ambulance was dispatched. This includes some calls not requiring lightsandsirens response, but does not include scheduled patient transfers. We include only the first event in our analysis when multiple responses are received for the same event; explanatory analysis did not reveal any spatial or temporal pattern for these cases, and we treat them as a single ambulance dispatch. We have removed all calls with no times or locations. There was no call received for more than two hours on March 10, 2007 due to a recording system malfunction, and we have also removed all calls from that day. These removals totaled less than 2% of the data. For example, Figure 1 shows the locations of all ambulance demand from January to July 2008.
Our proposed method, stKDE, gives significantly more accurate predictions than current industry practice with similar computational expense. It also compares favorably to the stateoftheart in ambulance demand prediction research.
We propose the stKDE model in §2 and discuss computational methods in §3. We show the empirical results on Toronto ambulance demand in Section §4, and conclude in §5.
We model Toronto’s ambulance demand on a continuous spatial domain and a discretized temporal domain of onehour intervals . Let be the location of the th ambulance demand arising from the th time period, for , where is the total number of ambulances demanded in the th period. Since a nonhomogeneous Poisson process (NHPP) is a natural model for a spatial point process [?, ?], we assume for each time period independently follow an NHPP over with positive intensity function . We decompose intensity function as , for . Here, is the aggregate demand intensity over the spatial domain, and is the continuous spatial density of the demand at time such that and . Hence, for each , and for . As mentioned before, numerous studies have proposed sophisticated and accurate methods for estimating . We thus focus on predicting the spatiotemporal demand density , which has received far less attention in the literature. With the predicted , we can furthermore predict spatiotemporal demand volume by multiplying our with from studies such as [?].
Suppose we observe and utilize historical ambulance demand from a set of past time periods , and we would like to forecast demand for a future time period . We propose to predict using a spatiotemporal weighted kernel density estimator. We place a bivariate spatial kernel at the location of each past observation in , and weight each kernel by the informativeness of the corresponding observation in predicting for the th time period. Specifically, we have for
(1) Here, is the informativeness weight function multiplied by the spatial kernel of the past observation . This weight function is defined in detail in §2.2. is the chosen bivariate spatial kernel with bivariate bandwidth , i.e.,
The weight function aims to capture the utility of a past observation in predicting demand at a future period. Specifically, we would like to incorporate in the spatial and temporal dependencies in the demand. Domain knowledge on EMS demand densities focuses our attention on weekly and daily seasonalities and shortterm serial dependence of a few hours, which have varying strengths in different neighborhoods.
We can therefore discretize the spatial domain into large spatial cells, representing a rough division into neighborhoods. We assume that temporal dependencies within each cell remain constant in space. Let denote the weight function local to each discretized cell . Within this cell, we further assume that the informativeness of a past observation from time in predicting for future time only depends on how far back is from . We use the weight function to measure how positively correlated two demand densities periods apart are in each cell. We model the weight function as
(2) for . We combine all to have
(3) Here, and for are all constrained to take values in . We use a separate parameter to capture each typical EMS patterns for easy interpretation and comparisons across locations (e.g., downtown vs suburbs) and times (e.g., winter vs summer). The term describes any potential shortterm serial dependence. Its parametric form is the same as a stationary firstorder autoregressive model, AR(1), and is also equivalent to the squared exponential function often used in Gaussian processes [?]. The term with describes any potential daily seasonality with , whereas the term with describes any potential weekly seasonality with . The parametric form of these two terms corresponds to the periodic function used in Gaussian processes [?]. These two seasonality terms are multiplied, and further discounted by a squared exponential function, . Finally, we sum the shortterm dependency effect and the seasonality effects. The different terms are combined in similar to the typical approach to combining covariance functions in Gaussian processes. . There may be other weight functions that work similarly; we draw inspirations from Gaussian processes because these functions are wellstudied and have some nice properties (e.g., infinite differentiability). This parametrization of the weight function is easy to interpret and visualize, and flexible to experiment with, even for nonexperts.
The weight function is bounded between 0 and 2. We avoid negative weights to avoid negative kernels in the kernel density estimator, which complicates bandwidth selection, results in negative density estimates that need to be floored at zero, and produces discontinuities in the derivatives of the estimates [?]. The magnitudes of the weights are nominal, as long as they are comparable across all regions, since they are normalized in Equation (Predicting Ambulance Demand: a SpatioTemporal Kernel Approach).
Equations (Predicting Ambulance Demand: a SpatioTemporal Kernel Approach), (2) and (3) together form the model.
We must select or estimate the kernel function and bivariate bandwidth in Equation (Predicting Ambulance Demand: a SpatioTemporal Kernel Approach), as well as the spatial discretization and number of parameters in the weight function (2). Since the nature of ambulance demand does not change drastically over time, these estimations may be performed infrequently in practice (at most several times a year).
For , we can use the typical Gaussian kernel, or for additional computational savings, the Epanechnikov kernel with bounded support. We can select the bandwidth via the plugin method [?] or smoothed crossvalidation [?]. We can also adopt one of many fast computational methods for KDE, including kdtrees [?], ball trees [?], dual trees [?] and statistical regular pavings [?].
For the weight function (2), we can choose the discretization mesh or a priori or via cross validation. A larger value of allows personalized temporal patterns on a finer grid, but if is too large, data may become too sparse for accurate estimation of temporal dependencies. In our application, is best chosen to be close to 20, yielding discrete regions that are roughly 5 km by 5 km each. The number of parameters in the weight function could be chosen in a number of standard ways; for instance we could use stochastic gradient ascent to maximize the joint likelihood of training data. For accurate estimation, we would need to use training data with tens of thousands of observations, and incur nontrivial computational cost. Here we introduce a much faster alternative method to estimate these parameters.
In Equation (2), measures how positively correlated two demand densities periods apart are at cell . We can directly estimate this correlation as follows. For each cell , we can approximate its demand density for any period by the proportion of observations arising from this cell out of all observations from that period. We can then obtain a time series of proportions and compute its (discrete) autocorrelation function for lag , where is the maximum lag considered. Typically can be taken to be around several weeks (hundreds of onehour periods). The nonnegative part of this autocorrelation, , or a smoothed version of it, is precisely what aims to capture. For example, Figure 2 (a) shows an example of the autocorrelation function , and the grey lines in Figure 2 (b) shows the corresponding , for (up to 4 weeks of onehour periods).
The goal is to find appropriate parameters such that best fits the shape of . To do this, we would like to minimize the sum of squared errors between and at all time lags from 1 to . We can find the optimal to for this minimization using gradient descent or grid search. The extra parameter is needed to scale to curvematch the magnitude of , and is of no real significance. Of greater importance is curvature or shape of , which is captured in to . To make comparable across all cells, we need to normalize such that the area under up to is the same across different cells.
In summary, we estimate the parameters in minimization problems: for each ,
(4) This computation is much more efficient than the joint estimation of parameters by maximizing likelihood. Here, we do not need to involve the kernel density estimator, nor loop through tens of thousands of ambulance demand observations. We can easily compute the minimization problems in parallel. For each cell, we have a lowdimensional (5 parameters) problem with a small number of observations (around hundreds of hours of time lags). A wide array of standard algorithms for solving optimization problems can be applied. For example, we can Lagrangian relax the constraint into the objective and use the genetic algorithm or particle swarm.
Once the infrequently performed parameter estimation is done, predictions for any future time period can be calculated instantaneously using short sliding windows of length . We can additionally refine or customize the prediction procedure in the following two ways.
First, to boost predictive accuracy, we can bilinearly interpolate the weight values smoothly over the spatial domain instead of taking only sets of values on a discretized grid. This is appropriate since we believe that the temporal patterns vary smoothly across the spatial domain. It also mitigates the sensitivity to predictions induced by choices of .
Secondly, we can impose an omission threshold value, , for the weights. If the weight of a past observation is below this threshold, i.e., if , we can omit this observation in the calculation of weighted KDE by overriding . The threshold can be chosen to balance the tradeoff between computational expense and predictive accuracy.
The computation has two stages. In the first stage, we estimate or choose all parameters, including the kernel , bandwidth , discretization and number of parameters. This estimation only needs to be performed infrequently. For this parameter estimation, we use Toronto ambulance data from January to July 2008. Figure 1 shows the spatial locations of all observations from this 7month period. In the second stage, we predict future ambulance demand on a sliding window of length (4 weeks, around observations) for each onehour period from August to December 2008.
In estimation, we choose the Gaussian kernel for , select the bandwidth via the plugin method [?] and discretize Toronto into equallysized regions. We estimate the parameters in the weight function using the method detailed in §3. As an example, we outline the cell covering downtown Toronto in Figure 1. We show in Figure 2 (a) the autocorrelation function for the proportions of observations arising from this downtown cell out of all observations across hourly time periods. This autocorrelation function indicates weekly, daily seasonalities and loworder serial dependence. Figure 2 (b) shows in grey and in black the fitted weight function for downtown, with (shortterm serial dependence), (daily seasonality), (weekly seasonality) and (discounting for seasonalities). The fitted weight function provides an interpretable basis for understanding exactly which historical observations are the most important for prediction. For example, from Figure 2 (b), an EMS manager can recognize that at downtown, ambulance demand in the past day or two and corresponding hour of the past few weeks are the most important. Crosscorrelations among the 21 weight functions estimated at different regions show that neighboring weight functions have some association, but those far apart are not correlated.
Once parameter estimation is done, we predict forward using a sliding window of 4 weeks for each onehour period from August to December 2008. Figure 3 shows the predictive densities on August 6, 2008 (Wednesday) at two different time periods. The ambulance demand is, not surprisingly, concentrated at the heart of downtown during day time on Wednesday (Figure 3 (b)), and more spread out throughout the city during night time on Wednesday (Figure 3 (a)). This illustrates that stKDE can differentiate temporal patterns at different time periods and locations.
We compare stKDE to the following competing methods

The MEDIC method, which is an industry practice implemented in CharlotteMecklenburg, NC (§1). We implement this method as far as we have data. The cell count in a 1km region and a 1hour period is predicted by the average of corresponding cell counts in the preceding 4 weeks in the past two years. The log predictive density produced by MEDIC for August 6, 2008 (Wednesday) at 2  3 am is shown in Figure 4. Compared to Figure 3 (a), the MEDIC prediction appears much noisier.

Two naive KDEs, (i) using data from the most recent hour to predict the next hour, and (ii) using all data from the past four weeks with equal weights (this produces a spatial only model, with almost no temporal variation).

A timevarying Gaussian mixture model. We quote results from [?] implemented on Toronto data with different training / testing months and various modeling specifications (e.g., number of components). The computational expense is considerable.
To compare the statistical predictive accuracies of our model and the comparison methods, we use the metric of average log score [?]. This performance measure is widely used because it is a strictly proper scoring rule closely related to Bayes factor and Bayes information criterion [?]. It is the average log likelihood of test data, and is defined as
in which are the test time periods, are the test data, and is the predictive density for the th period obtained either by various methods. Intuitively, it measures the probability that we observe test data given a model. Thus a less negative (higher) average log score indicates that the model is better at capturing the test data.
We show in Table 1 the predictive accuracies produced by stKDE and the comparison methods. We present three variations of stKDE prediction: (i) using the estimated discretized weight functions as they are, (ii) spatially interpolating the estimated weight values, and (iii) imposing an omission threshold on the estimated weight values such that each prediction uses a comparable amount of data as the industry method (about 200 observations).
The stKDE method significantly outperforms the MEDIC method (industry practice). It also outperforms the naive KDE methods, demonstrating the utility of incorporating spatiotemporal patterns via the weight functions. Our performance is comparable to timevarying GMM as it is implemented on Toronto data with different training / testing months and modeling specifications. Among the three variations of stKDE, allowing for bilinear interpolation of weight values improves the predictive accuracy slightly. In the third variation, including the omission threshold leads to a small loss of accuracy but reduces computational cost significantly to be comparable to the industry method.
Prediction method Accuracy stKDE + interpolation + threshold (less data) MEDIC naiveKDE most recent hour all equal weights GMM to Table \thetable: Predictive accuracies of stKDE and competing methods. Results of GMM are quoted from [?] implemented on Toronto data with various training / testing months and model specifications. The infrequent estimation of weight functions and bandwidth takes several hours on a personal computer. This offline training is significantly shorter than that of GMM (inferring latent component labels in GMM is costly). It does not take much longer than estimating bandwidths for naive KDE methods. Once estimation is done, making each new prediction is instantaneous (a few seconds). We could further reduce the computational expense of stKDE by parallelizing weight estimation, using a treebased algorithm for fast KDE computation, using a bounded kernel function, or creating a lookup table of densities (none of these was done).
Fineresolution spatiotemporal ambulance demand predictions are crucial to optimal ambulance planning. The EMS industry practice and early studies either sacrifice predictive accuracy for fast computation, or incur substantial computational cost in pursuit of high accuracy. We provide a muchneeded prediction method that is both accurate and fast. We predict spatiotemporal ambulance demand in Toronto with higher accuracy and comparable computational cost as a typical industry practice.
We propose a spatiotemporal weighted kernel density estimator. The spatial kernel of each historical observation is multiplied with a weight value to indicate the informativeness of this historical observation to the current predictive task. The spatiotemporal weight functions are inferred from dependencies in data, are unique to each neighborhood and can be updated regularly. This is an improvement from the ad hoc heuristic that only accounts for the weekly and yearly seasonality across the entire city. The weight functions are also flexible to represent various spatial and temporal characteristics. They are easy to experiment with, visualize and interpret by nonexperts. Moreover, stKDE easily handles missing data by placing zero weight and scaling up weights on other data proportionally. It can also easily predict many hours or days into the future.
The proposed method provides efficient estimation of the weight function, and offers customizable prediction to balance the tradeoffs between accuracy and computational cost. It is straightforward to implement by nonspecialized users and scalable to largescale datasets, and can be easily generalized to a wide range of other applications with spatialtemporal point process data.
The authors sincerely thank Toronto EMS for sharing their data, and support from a Xerox Faculty Research Award and NSF grant DMS1455172.
 [1] C. C. Aggarwal. A framework for diagonosing changes in evolving data streams. In ACM SIGMOD International Conference on Management of Data, pages 575 – 586, 2003.
 [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975.
 [3] C. Brunsdon, J. Corcoran, and G. Higgs. Visualising space and time in crime patterns: a comparison of methods. Computers, Environment and Urban Systems, 31:52 – 75, 2007.
 [4] N. Channouf, P. L’Ecuyer, A. Ingolfsson, and A. Avramidis. The application of forecasting techniques to modeling emergency medical system calls in calgary, alberta. Health Care Management Science, 10:25–45, 2007.
 [5] P. J. Diggle. Statistical Analysis of Spatial Point Patterns. Arnold, London, second edition, 2003.
 [6] T. Duong and M. L. Hazelton. Crossvalidation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32:485–506, 2005.
 [7] T. Gneiting and A. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102:359–378, 2007.
 [8] J. B. Goldberg. Operations research methods for the deployment of emergency service vehicles. EMS Management Journal, 1:20–39, 2004.
 [9] I. J. Good. Rational decisions. Journal of the Royal Statistical Society: Series B, 14:107–114, 1952.
 [10] A. G. Gray and A. W. Moore. Nonparametric density estimation: toward computational tractability. In Proceedings of the SIAM International Conference on Data Mining, 2003.
 [11] E. M. Jansenberger and P. StauferSteinnocher. Dual kernel density estimation as a method for describing spatiotemporal changes in the upper Austrian food retailing market. In AGILE Conference on Geographic Information Science, 2004.
 [12] D. S. Matteson, M. W. McLean, D. B. Woodard, and S. G. Henderson. Forecasting emergency medical service call arrival rates. Annals of Applied Statistics, 5:1379–1406, 2011.
 [13] J. Møller and R. P. Waagepetersen. Statistical Inference and Simulation for Spatial Point Processes. Chapman & Hall/CRC, London, 2004.
 [14] T. Nakaya and K. Yano. Visualising crime clusters in a spacetime cube: an exploratory data analysis approach using spacetime kernel density estimation and scan statistics. Transactions in GIS, 14:223 – 239, 2010.
 [15] S. M. Omohundro. Five balltree construction algorithms. International Computer Science Institute Technical Report, 1989.
 [16] C. M. Procopiuc and O. Procopiuc. Density estimation for spatial data streams. Advances in Spatial and Temporal Databases, 3633:109 – 126, 2005.
 [17] C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. MIT, Boston, 2006.
 [18] R. Sainudiin, G. Teng, J. Harlow, and D. Lee. Posterior expectation of regularly paved random histograms. ACM Transactions on Modeling and Computer Simulation, 23, 2013.
 [19] D. W. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. John Wiley and Sons, New York, 1992.
 [20] H. Setzler, C. Saydam, and S. Park. EMS call volume predictions: A comparative study. Computers & Operations Research, 36:1843–1851, 2009.
 [21] J. L. Vile, J. W. Gillard, P. R. Harper, and V. A. Knight. Predicting ambulance demand using singular spectrum analysis. Journal of the Operations Research Society, 63:1556–1565, 2012.
 [22] M. P. Wand and M. C. Jones. Multivariate plugin bandwidth selection. Computational Statistics, 9:97–116, 1994.
 [23] J. W. Wilesmith, M. A. Stevenson, C. B. King, and R. S. Morris. Spatiotemporal epidemiology of footandmouth disease in two counties of great britain in 2001. Preventive Veterinary Medicine, 61:157 – 170, 2003.
 [24] Z. Zhang, D. Chen, W. Liu, J. S. Racine, S. H. Ong, Y. Cheng, G. Zhao, and Q. Jiang. Nonparametric evaluation of dynamic disease risk: a spatiotemporal kernel approach. PLoS ONE, 6, 2011.
 [25] Z. Zhou, D. Matteson, D. Woodard, S. Henderson, and A. Micheas. A spatiotemporal point process model for ambulance demand. Journal of the American Statistical Association, 110:6–15, 2015.
