Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ “Real-Time Crime Forecasting Challenge”

# Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ “Real-Time Crime Forecasting Challenge”

\fnmsSeth \snmFlaxman\thanksreft1,m1label=e1]s.flaxman@imperial.ac.uk [    \fnmsMichael \snmChirico\thanksrefm2label=e2] [    \fnmsPau \snmPereira\thanksrefm3label=e3] [    \fnmsCharles \snmLoeffler\thanksrefm4label=e4] [ Department of Mathematics and Data Science Institute,
Imperial College London\thanksmarkm1, Grab, Singapore\thanksmarkm2, Amazon, Inc.\thanksmarkm3 and Department of Criminology, University of Pennsylvania\thanksmarkm4
###### Abstract

We propose a generic spatiotemporal event forecasting method, which we developed for the National Institute of Justice’s (NIJ) Real-Time Crime Forecasting Challenge (NIJ, 2017). Our solution to the challenge is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, heatmap-based (kernel density estimation) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags, bandwidths for smoothing kernels, as well as cell shape, size, and rotation, were learned using temporal crossvalidation. Resulting predictions significantly exceeded baseline KDE estimates; we had winning submissions to the competition in each of the different crime types and across different timescales, suggesting that our method is generically applicable.

\kwd
\arxiv

arXiv:1801.02858 \startlocaldefs \endlocaldefs

\runtitle

Forecasting spatiotemporal events with kernel methods {aug} , ,
and

\thankstext

t1Support was provided by the EPSRC (EP/K009362/1)

spatial statistics \kwdtime series \kwdsupervised learning \kwdspatiotemporal forecasting \kwdCox process \kwdRKHS

## 1 Introduction

Spatiotemporal forecasting of crime has been the focus of considerable attention in recent years as academic researchers, municipal police departments, and commercial entities have all sought to build forecasting tools that use police data to predict when and where both common and rare crimes are most likely to occur (Perry et al., 2013). The earliest crime forecasting tools consisted of nothing more than pin-maps (See, for example, Fig. 1). Prior week’s crimes were mapped and qualitative assessments of density, location, stability and significance were made (Schutt, 1922; Bratton, 1998).

Subsequent tools have adopted a range of different smoothing techniques to augment this method (Levine, 2004; Gorr, Olligschlaeger and Thompson, 2003; Gorr and Lee, 2015; Chainey, Tompson and Uhlig, 2008a; Johnson et al., 2009) with kernel density estimation the most commonly used approach (Gorr and Lee, 2015; Porter and Reich, 2012; Boni and Gerber, 2016). Many are model-driven, based on theories of crime causation (Caplan, Kennedy and Miller, 2011; Mohler et al., 2011). Some use log-Gaussian Cox Processes (LGCPs) (Rodrigues and Diggle, 2012; Shirota et al., 2017), while others use self-exciting point process models (SEPPs) (Levine, 2004; Liu and Brown, 2003; Taddy, 2010; Mohler et al., 2011; Rosser and Cheng, 2016) based on evidence of elevated levels of near-repeat victimization (Pease et al., 1998), particularly for burglary. Mohler et al. (2013) combines both LGCPs and SEPPs. Some use additional information, such as weather, demographics, and even social media (Wang, Gerber and Brown, 2012). Most rely on nothing more than past events to forecast future events (Chainey, Tompson and Uhlig, 2008b; Kang and Kang, 2017), suggesting that spatiotemporal methods that are effective at forecasting crime could readily be generalized to an increasing number of real-time spatiotemporal forecasting problems (Taddy, 2010). However, both designers and end-users of these many approaches to forecasting crime often confront the question of which of these approaches to adopt.

In 2016 the National Institute of Justice (NIJ) announced the NIJ Real-Time Crime Forecasting Competition to test which crime forecasting models could most accurately predict out-of-sample crime hotspots in the City of Portland during the Spring of 2017. This solicitation drew in a wide range of competitors including students, research labs, and companies. Teams were given five years of historical calls-for-service data from the Portland Police Bureau (PPB) and asked to submit predictions for the locations of the largest crime clusters in the subsequent weeks and months. Separate maps were requested for different forecasting windows and forecasted crime types.

Our team (“Team Kernel Glitches”) tied for first place in the large organization category with wins across a range of categories. While our solution to the competition performed equally well on frequent and sparse crime forecasts and over short and long durations, it performed especially well, compared to competitors and contemporary methods, at forecasting sparse events over short durations (e.g., burglaries). We attribute this success to our strategy of training as many model parameters as possible through supervised learning. Like the larger spatiotemporal forecasting literature, the models most commonly used in crime forecasting are often implemented outside of a supervised learning framework. This has meant that for any given model, features, weighting, bandwidths, and virtually every other model parameter has been set based on trial-and-error or theory rather than through a systematic search of optimal parameters for forecasting accuracy. For high volume events, this optimization may not be particularly necessary, but this is less likely to be the case for low volume events or high volume events over very short time periods.

In describing our solution to the NIJ competition, we make the following contributions: we propose a flexible, generic, and scalable spatiotemporal forecasting model, casting the problem of spatiotemporal forecasting explicitly as a supervised learning problem, while incorporating existing and highly successful modeling approaches from the spatiotemporal statistics literature: Gaussian processes, autoregressive terms, kernel smoothing, and self-exciting point processes. Our supervised learning setup gives us a coherent framework for the time-consuming task of optimizing hyperparameters, while our scalable approach to modeling and inference ensures that the model parameters themselves can be learned quickly enough to enable real-time forecasting. Our approach achieves accuracy improvements well beyond those generated by existing best-practices in crime prediction (Chainey, Tompson and Uhlig, 2008b; Johnson et al., 2009).

The rest of this paper is laid out as follows. Section 2 describes the details of the NIJ forecasting competition. Section 3 describes our model. Section 4 reports our competition performance. Section 5 concludes with a discussion of implications for future work on spatiotemporal prediction of crime and related phenomenon.

## 2 The competition

The goal of the NIJ Real-Time Crime Forecasting Competition was to accurately forecast hotspots for several categories of calls-for-service to the Portland Police Bureau (PPB) in Portland, Oregon. Contestants submitted forecasts on (or before) February 28, 2017 for various time horizons starting on March 1, 2017 and extending as late as May 31, 2017. The hotspot predictions were scored based on two metrics related to their accuracy. Contest rules required that contestants predict which of the 62,500 - 360,000 square foot cells111The extremes correspond to 250 ft by 250 ft and 600 ft by 600 ft squares, respectively; NIJ also introduced a minimum dimension of 125 feet (NIJ, 2017). within PPB’s 147.71 square mile service area would have the highest density of calls-for-service, with the total forecast area being no smaller than 0.25 square miles and no larger than 0.75 square miles, equivalent to forecasting 175–525 city blocks out of a total of 103,397 blocks222Portland city blocks are small compared to other American cities at 200 feet on a side; see https://www.strongtowns.org/journal/2013/11/27/optimizing-the-street-grid.html. Prizes were given out for five different cumulative forecast periods (1 week, 2 weeks, 4 weeks, 8 weeks, 12 weeks), four different crime categories (Burglary, Street Crimes, Theft of Autos, All Calls for Service), and two different accuracy metrics.

### 2.1 Data and Setting

The NIJ Real-Time Crime Forecasting dataset consists of 958,499 calls-for-service records from the Portland Police Bureau (PPB). These spatiotemporal records represent calls to Portland’s 911 system requesting police assistance from March 1st, 2012 through February 28th, 2017.333In practice, the final day of pre-competition data was not posted via NIJ’s website until after the submission deadline. Via web-scraping of twitter posts of PPB calls for service, a portion of this final day’s calls were incorporated into forecasts. On average, there were 2.6 burglary calls per day, 6.8 motor vehicle theft calls per day, 93 street crime calls per day, and 470 other calls per day for a total of 572 calls for service per day. As shown in Fig. 2, the four categories of crime, which themselves varied in the degree of internal heterogeneity, included Burglary (Burglary and Prowling), Street Crime (ranging from Disturbance and Threats up to Armed Robbery and Assault with a Firearm), Theft of Auto, and all calls for service. With a handful of exceptions444The following calls-for-service were excluded (primarily violent crimes): administrative, arson, crisis, death, domestic disturbance/violence, juvenile offenses, kidnapping, K9 explosive sweep, missing person, rape, restraining order, sex offense, stalking, and suicide. In addition, a handful of locations were also excluded (all hospitals and police precincts): 737 SE 106th Ave (East Precinct), 449 NE Emerson St (North Precinct), 2801 N Gantenbein Ave (Legacy Emanuel Hospital), 4804 NE Glisan St (Providence Hospital), 1111 SW 2nd Ave (Central Precinct), 10300 SE Main St (Adventist Medical Center), 1014 NW 22nd Ave (Legacy Good Samaritan Medical Center), 41 NE Grand Ave (Detox Center), 3181 SW Sam Jackson Park Rd (OHSU Hospital), 10123 SE Market St (Adventist Medical Center), 12240 NE Glisan (Multnomah County Sheriffâs Office), and 3303 SW Bond Ave (OHSU Hospital), this covered all calls for service routed to the Police Bureau during this period.

### 2.2 Metrics

Most crime forecasting tools begin by placing a grid over a region of interest and comparing the number of crimes observed in the forecasted “hotspot’ cells compared to the total number of crimes in the region as a whole. The simplest metric of this type is the “hit rate” (Chainey, Tompson and Uhlig, 2008a)555See Adepeju, Rosser and Cheng (2016) for a recent discussion of alternative evaluation metrics for crime forecasting.:

 Hit rate=nN

where is the number of crimes predicted and is the total number of crimes in that period in that area. The hit rate, also known as sensitivity in the statistics literature, is simply the fraction of crimes predicted for a given coverage area. Performance on this metric depends critically on the size of the forecasted area in addition to underlying crime densities and forecasting quality. For example, a coverage area including an entire city would trivially produce a perfect hit rate. However, as the coverage area shrinks both the difficulty of the task and value of resulting predictions increases. In the case of the NIJ competition, this coverage area was between 0.2% and 0.5% of the City of Portland, the region between the red vertical lines in Fig. 3.

The NIJ competition focused on two alternatives metrics (Chainey, Tompson and Uhlig, 2008a; Hunt, 2016), with a goal of allowing for a comparison of hit rates across forecasts using different coverage areas, something that is not possible if hit rates are used on their own. The first metric, the prediction accuracy index (PAI) (Chainey, Tompson and Uhlig, 2008a), is the ratio of the hit rate to the fraction of area covered:

 PAI=nNaA.

This metric directly incorporates the trade-off visible in Fig. 3 into the score weighting.666Note, however that the weighting of this tradeoff implicitly weights as equal the following: a multiplicative increase in the sensitivity and a multiplicative decrease in the forecasted area. This can be seen with a simple example: consider a sensitivity (hit rate) of 10% for a forecasted area covering 0.4% of the city. The PAI is 25. If we doubled the sensitivity to 20% or halved the forecasted area to 0.2% of the city PAI would equal 50.

The second metric, the prediction efficiency index (PEI), is the ratio of PAI to the hypothetically maximum PAI that could have been obtained using the chosen coverage area and the submitted grid. Since the forecasting area is the same in both the actual and hypothetical maximum cases, this reduces to:

 PEI=nn∗

where is the number of crimes occurring in predicted hotspots, and is the maximum number of crimes that could have been captured for the forecasted area.

After conducting exploratory data analysis, we noted that for sparse crimes at some horizons, , the totality of crimes in that category. For example, in all of the training periods’ one-week horizons (1st Week of March 2014, 1st Week of March 2015, etc.), there were only a small number of burglaries or motor vehicle thefts (15 and 43 respectively each week). This is because, for such a small number of crimes, even a tiny forecasted area (e.g., the minimum of 62,500 sq ft), ideally, would be able to cover every single incident. By contrast, for crimes which see hundreds or thousands of occurrences in a given horizon (street crimes in Portland occur at a rate of about 650 per week, all-calls-for-service are about 3690 per week), no matter how the hotspots are arranged, some crimes will not be in the forecasted areas since there is a maximum forecasted area of = 360,000 sq ft. In practice, this means that optimal strategy to maximize is to try and cover every crime that will occur in the designated hotspots.

Therefore, for relevant crime and horizons, we eliminated the forecasted area as a hyperparameter, since (in expectation) any forecast for an area less than could be improved by simply forecasting more hotspots. This optimization, however, comes with a trade-off. Optimizing for PEI in this manner incurs a PAI penalty proportional to the marginal change in forecasted area divided by the marginal change in correctly forecasted crimes. As the expectation of correctly predicting crime(s) in the marginal cell shrinks geometrically, the denominator continues to grow linearly. Therefore, the optimal cell selection for maximizing PEI will often fail to maximize PAI and vice versa. We therefore somewhat arbitrarily decided to maximize PEI rather than PAI. [Note: for a result making the opposite choice, see Mohler and Porter (2017).] Prizes were awarded for each metric separately.

## 3 Our model

Previous work has demonstrated that Gaussian process modeling of crime data can produce highly accurate long-term forecasts by combining the benefits of nonparametric methods with the interpretability of additive methods (Flaxman, 2014). Subsequent work (Flaxman et al., 2015) has proposed that instead of specifying a particular additive kernel structure, it is is possible to learn it directly from the data, given enough data and a rich enough class of kernels. This assumes, however, that it is possible to perform inference with very large datasets, as the standard approach to Gaussian process inference requires matrix algebra to manipulate the multivariate Gaussian distribution in Eq. (A.4), requiring time and storage. We therefore first present the hypothetical model we would use if computational constraints were not a concern, then we present our actual model, which is an approximation to this model.

Our hypothetical model is a log-Gaussian Cox Process (for an overview see Section A.1), meaning that it is a doubly stochastic model which we parameterize hierarchically as follows:

 f ∼GP(μ,kθ(⋅,⋅)) (3.1) N(S)|f ∼Poisson(∫Sexp(f(s))ds) (3.2)

is a counting measure. For any , is a Poisson distributed random variable giving the number of points in space-time volume . is real-valued, so to use it as the rate parameter in a Poisson distribution, we need to transform it to be non-negative. The choice of an exponential link function means that we are placing a Gaussian process prior on the log-intensity, thus the name log-Gaussian Cox Process. To complete the specification, we would need to parameterize and , where gives the mean of the underlying log-intensity and parameterizes its covariance.

Inference with this model is difficult because it is doubly intractable and existing approaches (Møller, Syversveen and Waagepetersen, 1998; Brix and Diggle, 2001; Cunningham, Shenoy and Sahani, 2008; Adams, Murray and MacKay, 2009; Teh and Rao, 2011; Diggle et al., 2013) are often limited to one dimension and small datasets. Lloyd et al. (2015) is a possible exception in that it points the way to a scalable stochastic variational inference approach.

Now we turn to our actual model, an approximation to the model above. Our first approximation requires the specification of a space-time grid covering . We assume that the grid has cells, each indexed by its centroid given as a latitude/longitude/timestamp triple . Now instead of a point pattern we consider our dataset as aggregate counts of the number of crimes per grid cell. Given a uniform grid, we can now approximate the integral in Eq. (3.2) with a sum:

 ∫Sexp(f(s))ds≈|S|NN∑i=1exp(f(si)) (3.3)

In a Poisson process, conditional on the intensity, the random variables and are independent for . Thus given , we can consider each grid cell independently, so Eq. (3.2) becomes:

 oi|f∼Poisson(exp(f(si)|Si|)),  ∀i (3.4)

We now have an iid likelihood (observation model) over all cells , and we arrive at the so-called computational grid approximation to the log-Gaussian Cox Process (Diggle et al., 2013; Flaxman et al., 2015).

is defined at any space-time coordinate using a space-time kernel . Naïvely, computations involving observed at cells would require storing and manipulating an covariance matrix at a cost of storage and computation, which is infeasible for large . We instead apply the random Fourier feature expansion of our kernel (Rahimi and Recht, 2007) as described in Section A.2. This involves selecting a dimension which determines the accuracy of our approximation and forming a new design matrix which is dimensional, where .

Viewing as an explicit feature expansion for our kernel, we can now use the machinery of generalized linear modeling (GLM): we have a regression problem where we want to learn coefficients for covariates given by a particular design matrix . Thus we have the following GLM with Poisson likelihood and exponential link function:

 oi|β,μ,Φ∼Poisson(exp(μ(xi,yi,ti)+Φiβ)) (3.5)

It remains to specify the function . One choice would be to simply use , but we know from previous work that using historical rates can be very effective in crime forecasting. We use a more flexible model, in line with our overall supervised machine learning approach and with previous work on crime forecasting using heatmaps.

 μ(x,y,t)=p∑i=1γiKDEλ,i(x,y,t) (3.6)

where we have lagged terms, each representing a spatial heatmap for a given time period in the past and we wish to learn regression coefficients . is the kernel density estimator at location using a Gaussian kernel which ignores time with spatial lengthscale :

 KDEλ,i(x,y,t)=∑{tj | t−i⋅D

where is the size of the temporal window in days.

Combined with a Poisson likelihood and with observations , we have specified a problem which is amenable to solving by maximum likelihood. Since we have the potential for a very large number of parameters (the more random frequencies we choose for the random Fourier feature expansion, the better our approximation) we consider the use of and regularization, as in the popular elastic net (Zou and Hastie, 2005).777While in practice it is possible that neither is ultimately necessary, especially since we crossvalidate to determine the number of random frequencies, the use of regularization with random Fourier features corresponds to kernel ridge regression, while the use of regression is meant to give the model the ability to zero out the the KDE features or particularly bad choices of frequencies. The parameters of the model are and , and we learn these by maximizing the penalized likelihood of our model using gradient descent, as implemented in the large-scale machine learning package Vowpal Wabbit. In particular, we fit the training dataset described in Table 1 using the Poisson likelihood as loss function, default settings for the learning algorithm (a variant of online gradient descent) and default settings for the learning rates, with at most 200 training passes (epochs) through the dataset. After fitting the model, we make predictions in the form of counts for the test data from Table 1, and then calculate PEI for each year of data.

To learn the hyperparameters we maximize average PEI. The hyperparameters related to our model are as follows: the number of random features in our feature expansion999We use the same random seed when generating these, so that the random features for e.g.  are a subset of those generated for ., the number of lags , the size of the temporal window , the spatial lengthscale for KDE (with a Gaussian kernel), the lengthscale of the covariance kernel (we used a Matérn-5/2 kernel), and the amount of and regularization. In addition, there are hyperparameters related to the data and forecasts: cell size, shape, grid rotation, and forecast area, which we also learn by crossvalidation. We crossvalidated over a very large grid of hyperparameters, supplemented with hyperparameter search using Bayesian Optimization (O’Hagan, 1992; Snoek, Larochelle and Adams, 2012; Hennig, Osborne and Girolami, 2015). The details of the hyperparameters we selected are given in the Appendix in Section B.

### 3.1 Data for training and hyperparameter selection

To ensure that our training datasets would be as close to the real task as possible, we adopted the following approach, which we show for the 1 week horizon forecasts, for simplicity, in Table 1. For a given spatial grid size, we restricted our temporal grids to match the forecasting window. The resulting data is therefore aggregated to the weekly level for one week forecasts, the monthly level for the monthly forecasts, and so on.

We then created a single dataset using data from the union of all of the training periods.101010In the competition, to produce a more robust model, all years of data were weighted equally. However, inverse weighting by temporal distance could produce additional benefits, especially for less stochastic event patterns. Similarly, in competition, the most recent and incomplete dataset for late 2016 and early 2017 was excluded from training the hyperparameters. We train the parameters of our model using this dataset. Then we produce forecasted hotspot maps for the five different validation periods in Table 1 and calculate PEI and PAI for each. We then report average heldout PEI and PAI, and it is these metrics that we maximize (in particular PEI) to select the hyperparameters of the model.

All of our computation was carried out on in a parallel cluster computing environment, with 8 Dell PowerEdge R630 nodes consisting of 2 Intel Xeon E5-2690 v4 2.6 GHz, 14 Core CPUs, and 256 GB memory.

### 3.2 Relationship with other work

The intuition behind our approach is that we combine state-of-the-art nonparametric spatiotemporal methods (Gaussian process regression) with the most long-standing and widely used crime forecasting method (KDE heatmaps) by defining sets of features for each. We then place these two sets of features into a supervised learning framework for forecasting the intensity, and consider a large set of hyperparameters and training data, with the hope that we will be able to obtain good predictive performance on unseen data.

In this section, we detail the connections between our kernel density estimation (KDE) features and the Hawkes process, which has recently received significant attention in the crime forecasting literature (Ogata, 1988; Møller and Rasmussen, 2005; Mohler et al., 2011, 2013; Mohler, 2014; Rosser and Cheng, 2016; Loeffler and Flaxman, 2017).

The conditional intensity function used in a spatiotemporal Hawkes process takes the following form:

 λ(x,y,t)=λ0(x,y,t)+∑{j|tj

where the first term is an underlying (endogeneous) intensity and the second term is the self-excitatory component and we assume that we have the same dataset as in the previous section. An illustration is shown in Figure 4 in black for a temporal process with exponential kernel : whenever an event occurs, the intensity increases, followed by exponential decay. This is in contrast to the kernel-smoothed intensity shown in blue for a temporal process with the intensity fit with KDE (actually, as we are estimating an intensity, not a density, this is technically known as kernel intensity estimation.) KDE is a nonparametric smoothing method which does not take the directionality of time into account, so the intensity rises before and after an event occurs. This seems like a critical difference, and the shape of the intensities in Figure 4 are clearly different. But as we see below, our use of KDE features in a forecasting framework corresponds exactly to what we could obtain using an analogous set of “Hawkes features” (the terminology comes from Mohler and Porter (2017)’s submission to the competition). As defined in Eq. 3.7, is the lag-1 spatial kernel density estimator at location using data with time labels , i.e.:

 KDEλ,1(x,y,t)=∑{j|tj∈[t−d,t]}ks((xj,yj),(x,y)) (3.9)

If we consider Eq. (3.8) and the special case in which

 kt(tj,t)={1 if tj∈[t−d,t]0 otherwise (3.10)

we see that the KDE feature is equivalent to the self-excitatory term in the Hawkes process conditional likelihood. How do we thus understand this equivalence in light of Figure 4? Within our forecasting framework, the past intensity is a feature that is used to forecast the future number of crimes. Since our KDE estimate of the intensity cannot take future data into account (from the point of view of the features, the future has not occurred yet), we recover the directionality of time which was explicit in the Hawkes approach.

The kernel in Eq. (3.10) is rather simplistic, suggesting an obvious extension to our method of including a more interesting temporal kernel in the KDE features of Eq. (3.9) and possibly also for other lags as in Eq. (3.7).

## 4 NIJ Challenge Results

In this section we describe the performance of our method according to the scoring metrics of the NIJ challenge, assess its robustness on a larger class of similar tasks, and investigate what features of the model contributed to its out-of-sample performance. Source code to reproduce our results is available: https://github.com/MichaelChirico/portland.

There were a total of 20 categories (4 crime types and 5 forecasting windows) with 40 prizes awarded, one for each of the highest PEI and PAI scores in each category. Our team won a total of 9 categories in the “Large Business” competition. As we focused on maximizing the forecasting performance on the out-of-sample PEI metric, most of our winning entries were in this category: all crimes (1 week, 1 month, 3 months), burglary (1 week, 2 weeks), street (2 weeks), and vehicle (1 week). We also had winning PAI entries for burglary (1 week and 2 weeks).

Based on these overall results we first consider the two metrics, PEI and PAI. PEI and PAI clearly measure different things, because maximizing PEI did not usually lead to maximizing PAI or vice versa. However, they are related: the Pearson’s correlation between the two metrics by crime type and period ranged from 0.85 to 0.99 with particularly high correlations for all crimes and street crimes, the higher frequency crime types. Indeed, as the forecasting period lengthened the correlation increased as well, from 0.55 for all 1 week forecasts to 0.95 for all 3 month forecasts. Note that this does not explain our simultaneous winning entries for burglary in the PAI and PEI categories at 1 week and 2 weeks, an issue we consider more detail below.

Next we investigate the hyperparameters that were selected by our hyperparameter search strategy (grid search and Bayesian Optimization side-by-side). As shown in the Appendix in Table B, there were no consistently chosen hyperparameter values: the grid cells were sometimes small squares (the minimum area) or large squares (the maximum area) or large rectangles (also the maximum area). The coverage fraction ranged from the minimum (0.25 sq miles) to the maximum (0.75 sq miles). The lengthscales for space and time were all over the place, as were the number of KDE lags and the KDE bandwidth. Interestingly, the number of random Fourier features went as low as , which means that the surface was a very crude approximation to a Gaussian process consisting of the sum of random sine and cosine functions to as high as , a much better approximation. In many cases, no or regularization was needed. In a minority of cases (4 out of 20) the hyperparameters were chosen by Bayesian Optimization, while the rest were the result of the grid search. Note that the grid search considered many many more combinations of hyperparameters, as the vanilla Bayesian Optimization approach did not make use of parallel computing. While it is clear from Table B that the performance depended on optimizing the hyperparameters, reinforcing the importance of supervised learning to the optimization of crime forecasting models, the question remains how strong was this dependence, i.e. were there many reasonable choices for the hyperparameters.

As one means of assessing this question, we considered the distribution of PEI values for each crime category and forecasting window and used this distribution to calculate the z-score of each maximum. In all crimes, the z-scores ranged from 2.5 to 4.0. In street the z-scores ranged from 2.8 to 5.6. In vehicle the z-scores ranged from 7.5 to 21. In burglary the z-scores ranged from 6.2 to 12.4. In many cases, the particularly high z-scores corresponded to winning entries, e.g. our winning vehicle 1 week entry had a PEI z-score of 21, and our winning burglary entries had z-scores of 12.4 (1 week) and 11 (2 weeks). This supports the idea that there was a particularly good set of hyperparameters found by our grid search. Such high z-scores often generate concerns of overfitting, but the fact that these entries were often winning entries is reassuring in this respect. As is the fact that these high z-scores are explained in part by the relative frequency of hyperparameters which led to PEI scores of 0: for the 1 week vehicle and burglary categories, 41% and 44% (respectively) of the possible hyperparameter combinations gave PEI scores of 0.

As shown in Figure 5, which depicts the competition PEI maximums, for high volume crimes, such as all calls for service, even a week’s worth of data is sufficient to achieve very high PEI scores (nearly 0.9) of the theoretical limit (1) for a one week prediction. Extending the cumulative forecast period leads to further improvements in forecasting accuracy, plateauing at 97%. Sizable sub-categories, such as street crimes, share this basic trajectory as well. For some sparse crimes, such as motor vehicle theft, despite lower starting values, similar improvements in predictive accuracy can be seen as the forecasting windows are expanded, even if these improvements are not strictly monotonically increasing. However, for other sparse crimes, such as burglary, adding additional weeks of data to the forecast period does little to improve the forecast accuracy.

### 4.1 Investigating our performance

As discussed in Section 1, almost all crime forecasting in practice relies on heatmap-type approaches. As our model included lagged KDE terms, we expected to always perform as well as a heatmap-type baseline. As a post-competition check, we fit a model with just one KDE lag, corresponding to a heatmap-type baseline, and fixed parameters according to common practice (Chainey and Ratcliffe, 2005) and found that our model was better than this baseline 90% of the time on the true out-of-sample forecasted data (March-May 2017) with an average absolute improvement of 0.16. The improvements were most notable for sparse crimes and short time-horizons, as the baseline model sometimes identified no correct vehicle or burglary hotspots (Figure 6(a)). Interestingly, several of the forecasts for which simple KDE outperformed our model (e.g., Burglary 2m and 3m), hyperparameters for the model were selected using Bayesian Optimization rather than grid search, suggesting that BO will not always give the optimal set of parameters.

To further investigate the importance of various parts of our model, we compared our full method, which combines lagged KDE terms and a Gaussian process surface, to a model without the Gaussian process surface (Figure 6(b)). Here, the full model gave better PEI results 75% of the time, with an average absolute improvement of 0.05. Thus, although our full model is still an improvement, the improvements were not as dramatic as going from a simple heatmap to our full model.

Figure 7 reports two additional comparisons. Figure 6(c) compares the performance of the full model to the best performing grid search model without rotation. The second compares the full model to the best performing grid search model with 600x600 sq ft cells. Rosser et al. (2016) recently demonstrated that due to geocoding, non-cardinal land use, and other related factors, a non-standard alignment could improve predictive accuracy in crime forecasting. In the present application, we explored altering the rotation of the entire tessellation and the dimensions of the cell rectangles. With improved performance of only 0.029 for a freely-rotated model when compared to the best performing non-rotated model, rotation does not appear to be a major contributor to overall performance as can also be seen in Figure 6. However, certain crime categories and forecast windows can be observed to benefit more substantially. A similar result can be observed in Figure 6(d) for altering cell dimensions, which only improves overall performance by 0.019 when compared to a conventionally used rectangle.

Since crossvalidation was used to train the model, the predictive performance achieved is likely to be quite stable. At the same time, the sparseness of several of the forecasted incidents and recent findings on lack of robustness of forecasting models (Rosser and Cheng, 2016) suggests that it is worthwhile to examine the repeat performance of the model. To accomplish this, the 13 week competition period (March through May 2017) was split into 13 one-week forecast periods and a one-week rolling forward prediction was made for each week. The resulting predictions, as seen in Figure 8, manifest variability consistent with the stochastic events being predicted. However, these rolling-forward predictions provide little evidence of over-fitting, even for the sparsest of incidents. They instead suggest, at least for settings like the competition, that the short-term accuracy improvements are robust and repeatable.

Figure 9 (left) shows the actual performance of the full forecasting model for a high volume crime category (ACFS) and a middle-range forecasting period (1 month). Polygons that were correctly forecast as the highest possible crime count polygons are in green. Polygons incorrectly forecast to not be hotspots are in red. And polygons that were incorrectly predicted to be the highest possible crime count polygons are depicted in blue. Crimes are black dots. As mentioned previously, the largest single cluster of hotspots for all-calls-for-service can be seen downtown. However, the model slightly over invested in this section of Portland. As can be seen in the inset, hotspots just across the river had more crimes reported over the relevant forecast window. In practice, most of these misses were relatively small, with “false negatives” only slightly “hotter” than the corresponding “false positive” cells (e.g., 44 crimes in a FN cell versus 39 crimes in a FP cell).

Figure 9 (right) show the actual performance of the model for a sparse crime category (burglary) and a short-range forecasting period (1 week). Forecasted burglary cells are depicted with boxes and actual burglaries are depicted by blue x’s. Boxed x’s indicate a successful prediction while empty boxes indicate a “false positive” prediction. The absence of large-scale clustering is quite visible in both the dispersion of the burglaries throughout Portland and in the similarly dispersed allocation of predictions. As can be seen in the inset, a successful prediction was accompanied by several near misses in the vicinity, including one near-miss off by only a single cell. Predicting sparse crimes, while more difficult than predicting concentrated crimes, is still achievable and with accuracy levels not previously seen with other conventional forecasting methods.

## 5 Discussion

Real-time spatiotemporal forecasting is an area of increasing interest among both practitioners and scholars. And yet, many of the approaches that are commonly in use, such as kernel-smoothing based on fixed bandwidths and cell sizes, can be quite limited in their effectiveness, even for frequent crime types but especially for sparse events. Recent work has begun exploring how sensitive forecasting accuracy is to particular parameter configurations, showing that predictive accuracy is more sensitive to bandwidth selection than cell dimensions (Chainey, 2013). The present investigation similarly suggests that default and other rule-of-thumb parameter choices are not likely to produce the highest accuracy predictions. From a supervised learning perspective, this should come as no surprise, and so in the present study, we exhaustively tuned each of the hyperparameters in our model, showing that careful searching of commonly used parameters can produce performance improvements exceeding those achieved conventionally. Perhaps more importantly, this supervised learning exercise also revealed that while certain parameters are often more important to accuracy than others, no one parameter is universally better and as such, supervised learning across a range of parameters will likely need to be a continuing feature of spatiotemporal event forecasting.

This study also suggests the benefits of a generic framework combining two types of kernel methods—Gaussian processes and kernel smoothing. Among 9 wins in the NIJ forecasting competition, our framework performed equally well on several of the different sub-classes of the problem. Forecasting all-calls-for-service is an excellent example of a heterogeneous high volume spatiotemporal event. Other examples in this category, for which our method might be useful, include ride-sharing demand, which may be driven by a number of different use patterns. We were particularly effective at forecasting burglary, illustrating a sparse process that exhibits minimal clustering. Past work (Johnson et al., 2009) has reported 1-week burglary forecasting accuracy of 10% at 1.3% of coverage area and 25% of burglaries at 5% of coverage area using near-repeat models with baseline KDE models producing 1-week accuracy of 10% at 2% coverage and 25% at 6.5% coverage. By comparison, median 1-week burglary accuracy of 10% was achieved with a coverage area of 0.5% and 50% of the time 25% forecasting accuracy was achieved at 0.5% coverage.111111Mohler et al. (2011) report 5% accuracy for daily predictions at comparable coverage levels. This performance on the sparest of forecast events over the shortest time period in the competition was sufficiently high that it not only won for the scoring metric for which it was optimized but also for the scoring metric for which it was not.

A final question to consider is what is to be done with the information generated by spatiotemporal forecasting models? In recent years, crime forecasting tools have been a supplement or replacement for traditional crime analysis, which has elevated concerns about the possible bias that could come from relying on non-randomly filtered information being the basis for deployment decisions that could lead to additional enforcement actions targeted at particular groups of individuals (Lum and Isaac, 2016).121212Note that forecasting crime in general and forecasting crimes reported to the police in particular are often conflated, yet some types of crimes are systematically underreported, and simultaneously police focus their efforts on some classes of crimes and not others, e.g. street crime vs. white collar crime. For a compelling artistic representation of this issue, see the New Inquiry’s “White Collar Crime Risk Zones,” https://whitecollar.thenewinquiry.com/ (Lavigne, Clifton and Tseng, 2017). While other possibilities beyond enforcement do exist and are actively being explored (Taniguchi and Groff, 2017), these concerns regarding the fairness implications of spatiotemporal forecasting methods are unlikely to be resolved separately from larger questions of fairness in criminal justice decision-making despite recent attempts to formulate definitions of algorithmic fairness (Berk et al., 2017; Corbett-Davies et al., 2017).

Pending the resolution of these algorithmic fairness discussions, we note that bias is only one of the limitations of existing forecasting tools, the other being accuracy, which was the focus of the NIJ competition. The results reported here and elsewhere suggest that forecasting the hottest hotspots for high volume crimes can be done today with great accuracy using a range of different approaches. The same cannot be said for sparse events, at least not yet. While the accuracy of our sparse/short-term forecasts was an order of magnitude above competitors on our competition metric (PEI), it is still only in the low double-digits. This leaves as an open question, partially addressed by this paper and the NIJ competition, whether rare crime events are intrinsically harder to forecast due to random error or are simply harder because of insufficient training data. The fact that some rare crime forecasts saw no improvement in forecasting accuracy despite the addition of more data could be considered suggestive evidence that there may be a signal limit for this type of event. However, if this limit exists, it is important not to conclude that it generalizes to other events, since other events or the same event in other settings may not manifest this pattern. Refitting our models in other settings, both criminological and otherwise, would shed further light on this question.

## References

• Adams, Murray and MacKay (2009) {binproceedings}[author] \bauthor\bsnmAdams, \bfnmRyan Prescott\binitsR. P., \bauthor\bsnmMurray, \bfnmIain\binitsI. \AND\bauthor\bsnmMacKay, \bfnmDavid JC\binitsD. J. (\byear2009). \btitleTractable nonparametric Bayesian inference in Poisson processes with Gaussian process intensities. In \bbooktitleProceedings of the 26th Annual International Conference on Machine Learning \bpages9–16. \bpublisherACM. \endbibitem
• Adepeju, Rosser and Cheng (2016) {barticle}[author] \bauthor\bsnmAdepeju, \bfnmMonsuru\binitsM., \bauthor\bsnmRosser, \bfnmGabriel\binitsG. \AND\bauthor\bsnmCheng, \bfnmTao\binitsT. (\byear2016). \btitleNovel evaluation metrics for sparse spatio-temporal point process hotspot predictions - a crime case study. \bjournalInternational Journal of Geographical Information Science \bvolume30 \bpages2133–2154. \bdoi10.1080/13658816.2016.1159684 \endbibitem
• Berk et al. (2017) {barticle}[author] \bauthor\bsnmBerk, \bfnmRichard\binitsR., \bauthor\bsnmHeidari, \bfnmHoda\binitsH., \bauthor\bsnmJabbari, \bfnmShahin\binitsS., \bauthor\bsnmKearns, \bfnmMichael\binitsM. \AND\bauthor\bsnmRoth, \bfnmAaron\binitsA. (\byear2017). \btitleFairness in Criminal Justice Risk Assessments: The State of the Art. \endbibitem
• Boni and Gerber (2016) {binproceedings}[author] \bauthor\bsnmBoni, \bfnmMohammad Al\binitsM. A. \AND\bauthor\bsnmGerber, \bfnmMatthew S.\binitsM. S. (\byear2016). \btitleAutomatic Optimization of Localized Kernel Density Estimation for Hotspot Policing. In \bbooktitleInternational Conference on Machine Learning and Applications \bpages32–38. \bpublisherIEEE. \bdoi10.1109/ICMLA.2016.0015 \endbibitem
• Bratton (1998) {bbook}[author] \bauthor\bsnmBratton, \bfnmWilliam\binitsW. (\byear1998). \btitleTurnaround: how America’s top cop reversed the crime epidemic, \bedition1st ed. ed. \bpublisherRandom House, \baddressNew York. \endbibitem
• Brix and Diggle (2001) {barticle}[author] \bauthor\bsnmBrix, \bfnmAnders\binitsA. \AND\bauthor\bsnmDiggle, \bfnmPeter J\binitsP. J. (\byear2001). \btitleSpatiotemporal prediction for log-Gaussian Cox processes. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume63 \bpages823–841. \endbibitem
• Caplan, Kennedy and Miller (2011) {barticle}[author] \bauthor\bsnmCaplan, \bfnmJoel M.\binitsJ. M., \bauthor\bsnmKennedy, \bfnmLeslie W.\binitsL. W. \AND\bauthor\bsnmMiller, \bfnmJoel\binitsJ. (\byear2011). \btitleRisk Terrain Modeling: Brokering Criminological Theory and GIS Methods for Crime Forecasting. \bjournalJustice Quarterly \bvolume28 \bpages360–381. \bdoi10.1080/07418825.2010.486037 \endbibitem
• Chainey (2013) {barticle}[author] \bauthor\bsnmChainey, \bfnmS. P.\binitsS. P. (\byear2013). \btitleExamining the influence of cell size and bandwidth size on kernel density estimation crime hotspot maps for predicting spatial patterns of crime. \bjournalBulletin of the Geographical Society of Liege \bvolume60 \bpages7–19. \endbibitem
• Chainey and Ratcliffe (2005) {bbook}[author] \bauthor\bsnmChainey, \bfnmSpencer\binitsS. \AND\bauthor\bsnmRatcliffe, \bfnmJerry\binitsJ. (\byear2005). \btitleGIS and crime mapping. \bpublisherJohn Wiley & Sons. \endbibitem
• Chainey, Tompson and Uhlig (2008a) {barticle}[author] \bauthor\bsnmChainey, \bfnmSpencer\binitsS., \bauthor\bsnmTompson, \bfnmLisa\binitsL. \AND\bauthor\bsnmUhlig, \bfnmSebastian\binitsS. (\byear2008a). \btitleThe Utility of Hotspot Mapping for Predicting Spatial Patterns of Crime. \bjournalSecurity Journal \bvolume21 \bpages4–28. \bdoi10.1057/palgrave.sj.8350066 \endbibitem
• Chainey, Tompson and Uhlig (2008b) {barticle}[author] \bauthor\bsnmChainey, \bfnmSpencer\binitsS., \bauthor\bsnmTompson, \bfnmLisa\binitsL. \AND\bauthor\bsnmUhlig, \bfnmSebastian\binitsS. (\byear2008b). \btitleResponse to Levine. \bjournalSecurity Journal \bvolume21 \bpages303–306. \bdoi10.1057/sj.2008.7 \endbibitem
• Choi and Schervish (2007) {barticle}[author] \bauthor\bsnmChoi, \bfnmTaeryon\binitsT. \AND\bauthor\bsnmSchervish, \bfnmMark J\binitsM. J. (\byear2007). \btitleOn posterior consistency in nonparametric regression problems. \bjournalJournal of Multivariate Analysis \bvolume98 \bpages1969–1987. \endbibitem
• Corbett-Davies et al. (2017) {barticle}[author] \bauthor\bsnmCorbett-Davies, \bfnmSam\binitsS., \bauthor\bsnmPierson, \bfnmEmma\binitsE., \bauthor\bsnmFeller, \bfnmAvi\binitsA., \bauthor\bsnmGoel, \bfnmSharad\binitsS. \AND\bauthor\bsnmHuq, \bfnmAziz\binitsA. (\byear2017). \btitleAlgorithmic decision making and the cost of fairness. \bjournalarXiv preprint arXiv:1701.08230. \endbibitem
• Cressie and Wikle (2011) {bbook}[author] \bauthor\bsnmCressie, \bfnmN.\binitsN. \AND\bauthor\bsnmWikle, \bfnmC. K.\binitsC. K. (\byear2011). \btitleStatistics for spatio-temporal data \bvolume465. \bpublisherWiley. \endbibitem
• Cunningham, Shenoy and Sahani (2008) {binproceedings}[author] \bauthor\bsnmCunningham, \bfnmJohn P\binitsJ. P., \bauthor\bsnmShenoy, \bfnmKrishna V\binitsK. V. \AND\bauthor\bsnmSahani, \bfnmManeesh\binitsM. (\byear2008). \btitleFast Gaussian process methods for point process intensity estimation. In \bbooktitleProceedings of the 25th international conference on Machine learning \bpages192–199. \bpublisherACM. \endbibitem
• Diggle et al. (2013) {barticle}[author] \bauthor\bsnmDiggle, \bfnmPeter J\binitsP. J., \bauthor\bsnmMoraga, \bfnmPaula\binitsP., \bauthor\bsnmRowlingson, \bfnmBarry\binitsB., \bauthor\bsnmTaylor, \bfnmBenjamin M\binitsB. M. \betalet al. (\byear2013). \btitleSpatial and spatio-temporal Log-Gaussian Cox processes: extending the geostatistical paradigm. \bjournalStatistical Science \bvolume28 \bpages542–563. \endbibitem
• Fernandez and Teh (2016) {bunpublished}[author] \bauthor\bsnmFernandez, \bfnmT.\binitsT. \AND\bauthor\bsnmTeh, \bfnmY. W.\binitsY. W. (\byear2016). \btitlePosterior Consistency for a Non-parametric Survival Model under a Gaussian Process Prior. \bnoteArXiv e-prints: 1611.02335. \endbibitem
• Flaxman (2014) {barticle}[author] \bauthor\bsnmFlaxman, \bfnmSeth R\binitsS. R. (\byear2014). \btitleA General Approach to Prediction and Forecasting Crime Rates with Gaussian Processes. \bjournalHeinz College Technical Report. \endbibitem
• Flaxman et al. (2015) {binproceedings}[author] \bauthor\bsnmFlaxman, \bfnmSeth\binitsS., \bauthor\bsnmWilson, \bfnmAndrew\binitsA., \bauthor\bsnmNeill, \bfnmDaniel\binitsD., \bauthor\bsnmNickisch, \bfnmHannes\binitsH. \AND\bauthor\bsnmSmola, \bfnmAlex\binitsA. (\byear2015). \btitleFast Kronecker inference in Gaussian processes with non-Gaussian likelihoods. In \bbooktitleInternational Conference on Machine Learning \bpages607–616. \endbibitem
• Gorr and Lee (2015) {barticle}[author] \bauthor\bsnmGorr, \bfnmWilpen L.\binitsW. L. \AND\bauthor\bsnmLee, \bfnmYongJei\binitsY. (\byear2015). \btitleEarly Warning System for Temporary Crime Hot Spots. \bjournalJournal of Quantitative Criminology \bvolume31 \bpages25–47. \bdoi10.1007/s10940-014-9223-8 \endbibitem
• Gorr, Olligschlaeger and Thompson (2003) {barticle}[author] \bauthor\bsnmGorr, \bfnmWilpen\binitsW., \bauthor\bsnmOlligschlaeger, \bfnmAndreas\binitsA. \AND\bauthor\bsnmThompson, \bfnmYvonne\binitsY. (\byear2003). \btitleShort-term forecasting of crime. \bjournalInternational Journal of Forecasting \bvolume19 \bpages579–594. \bdoi10.1016/S0169-2070(03)00092-X \endbibitem
• Hart and Zandbergen (2014) {barticle}[author] \bauthor\bsnmHart, \bfnmTimothy\binitsT. \AND\bauthor\bsnmZandbergen, \bfnmPaul\binitsP. (\byear2014). \btitleKernel density estimation and hotspot mapping: Examining the influence of interpolation method, grid cell size, and bandwidth on crime forecasting. \bjournalPolicing: An International Journal of Police Strategies & Management \bvolume37 \bpages305–323. \bdoi10.1108/PIJPSM-04-2013-0039 \endbibitem
• Heaton et al. (2017) {barticle}[author] \bauthor\bsnmHeaton, \bfnmM. J.\binitsM. J., \bauthor\bsnmDatta, \bfnmA.\binitsA., \bauthor\bsnmFinley, \bfnmA.\binitsA., \bauthor\bsnmFurrer, \bfnmR.\binitsR., \bauthor\bsnmGuhaniyogi, \bfnmR.\binitsR., \bauthor\bsnmGerber, \bfnmF.\binitsF., \bauthor\bsnmGramacy, \bfnmR. B.\binitsR. B., \bauthor\bsnmHammerling, \bfnmD.\binitsD., \bauthor\bsnmKatzfuss, \bfnmM.\binitsM., \bauthor\bsnmLindgren, \bfnmF.\binitsF., \bauthor\bsnmNychka, \bfnmD. W.\binitsD. W., \bauthor\bsnmSun, \bfnmF.\binitsF. \AND\bauthor\bsnmZammit-Mangion, \bfnmA.\binitsA. (\byear2017). \btitleA Case Study Competition Among Methods for Analyzing Large Spatial Data. \bjournalArXiv e-prints. \endbibitem
• Hennig, Osborne and Girolami (2015) {barticle}[author] \bauthor\bsnmHennig, \bfnmPhilipp\binitsP., \bauthor\bsnmOsborne, \bfnmMichael A.\binitsM. A. \AND\bauthor\bsnmGirolami, \bfnmMark\binitsM. (\byear2015). \btitleProbabilistic numerics and uncertainty in computations. \bjournalProceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences \bvolume471. \endbibitem
• Hunt (2016) {bphdthesis}[author] \bauthor\bsnmHunt, \bfnmJoel M\binitsJ. M. (\byear2016). \btitleDo crime hot spots move? Exploring the effects of the modifiable areal unit problem and modifiable temporal unit problem on crime hot spot stability, \btypePhD thesis, \bpublisherAmerican University. \endbibitem
• Johnson et al. (2009) {bincollection}[author] \bauthor\bsnmJohnson, \bfnmShane D\binitsS. D., \bauthor\bsnmBowers, \bfnmKate J\binitsK. J., \bauthor\bsnmBirks, \bfnmD. J.\binitsD. J. \AND\bauthor\bsnmPease, \bfnmKen\binitsK. (\byear2009). \btitlePredictive mapping of crime by ProMap: accuracy, units of analysis, and the environmental backcloth. In \bbooktitlePutting Crime in its Place (\beditor\bfnmDavid\binitsD. \bsnmWeisburd, \beditor\bfnmWim\binitsW. \bsnmBernasco \AND\beditor\bfnmGerben\binitsG. \bsnmBruinsma, eds.) \bpages165–192. \bpublisherSpringer, \baddressDordrecht. \endbibitem
• Kang and Kang (2017) {barticle}[author] \bauthor\bsnmKang, \bfnmHyeon-Woo\binitsH.-W. \AND\bauthor\bsnmKang, \bfnmHang-Bong\binitsH.-B. (\byear2017). \btitlePrediction of crime occurrence from multi-modal data using deep learning. \bjournalPLOS ONE \bvolume12 \bpagese0176244. \bdoi10.1371/journal.pone.0176244 \endbibitem
• Lavigne, Clifton and Tseng (2017) {barticle}[author] \bauthor\bsnmLavigne, \bfnmS.\binitsS., \bauthor\bsnmClifton, \bfnmB.\binitsB. \AND\bauthor\bsnmTseng, \bfnmF.\binitsF. (\byear2017). \btitlePredicting Financial Crime: Augmenting the Predictive Policing Arsenal. \bjournalArXiv e-prints. \endbibitem
• Levine (2004) {btechreport}[author] \bauthor\bsnmLevine, \bfnmNed\binitsN. (\byear2004). \btitleCrimeStat: a spatial statistics program for the analysis of crime incident locations, version 3.0. \btypeTechnical Report, \bpublisherNed Levine and Associates/National Institute of Justice, \baddressWashington, DC. \endbibitem
• Liu and Brown (2003) {barticle}[author] \bauthor\bsnmLiu, \bfnmHua\binitsH. \AND\bauthor\bsnmBrown, \bfnmDonald E.\binitsD. E. (\byear2003). \btitleCriminal incident prediction using a point-pattern-based density model. \bjournalInternational Journal of Forecasting \bvolume19 \bpages603–622. \bdoi10.1016/S0169-2070(03)00094-3 \endbibitem
• Lloyd et al. (2015) {binproceedings}[author] \bauthor\bsnmLloyd, \bfnmChris\binitsC., \bauthor\bsnmGunter, \bfnmTom\binitsT., \bauthor\bsnmOsborne, \bfnmMichael\binitsM. \AND\bauthor\bsnmRoberts, \bfnmStephen\binitsS. (\byear2015). \btitleVariational inference for Gaussian process modulated Poisson processes. In \bbooktitleInternational Conference on Machine Learning \bpages1814–1822. \endbibitem
• Loeffler and Flaxman (2017) {barticle}[author] \bauthor\bsnmLoeffler, \bfnmCharles\binitsC. \AND\bauthor\bsnmFlaxman, \bfnmSeth\binitsS. (\byear2017). \btitleIs gun violence contagious? \bjournalJournal of Quantitative Criminiology. \endbibitem
• Lum and Isaac (2016) {barticle}[author] \bauthor\bsnmLum, \bfnmKristian\binitsK. \AND\bauthor\bsnmIsaac, \bfnmWilliam\binitsW. (\byear2016). \btitleTo predict and serve? \bjournalSignificance \bvolume13 \bpages14–19. \endbibitem
• Mohler (2014) {barticle}[author] \bauthor\bsnmMohler, \bfnmGeorge\binitsG. (\byear2014). \btitleMarked point process hotspot maps for homicide and gun crime prediction in Chicago. \bjournalInternational Journal of Forecasting \bvolume30 \bpages491–497. \endbibitem
• Mohler et al. (2013) {barticle}[author] \bauthor\bsnmMohler, \bfnmGeorge\binitsG. \betalet al. (\byear2013). \btitleModeling and estimation of multi-source clustering in crime and security data. \bjournalThe Annals of Applied Statistics \bvolume7 \bpages1525–1539. \endbibitem
• Mohler and Porter (2017) {barticle}[author] \bauthor\bsnmMohler, \bfnmGeorge\binitsG. \AND\bauthor\bsnmPorter, \bfnmMichael D.\binitsM. D. (\byear2017). \btitleRotational grid, PAI-maximizing crime forecasts. \endbibitem
• Mohler et al. (2011) {barticle}[author] \bauthor\bsnmMohler, \bfnmG. O.\binitsG. O., \bauthor\bsnmShort, \bfnmM. B.\binitsM. B., \bauthor\bsnmBrantingham, \bfnmP. J.\binitsP. J., \bauthor\bsnmSchoenberg, \bfnmF. P.\binitsF. P. \AND\bauthor\bsnmTita, \bfnmG. E.\binitsG. E. (\byear2011). \btitleSelf-Exciting Point Process Modeling of Crime. \bjournalJournal of the American Statistical Association \bvolume106 \bpages100–108. \bdoi10.1198/jasa.2011.ap09546 \endbibitem
• Møller and Rasmussen (2005) {barticle}[author] \bauthor\bsnmMøller, \bfnmJesper\binitsJ. \AND\bauthor\bsnmRasmussen, \bfnmJakob G\binitsJ. G. (\byear2005). \btitlePerfect simulation of Hawkes processes. \bjournalAdvances in applied probability \bvolume37 \bpages629–646. \endbibitem
• Møller, Syversveen and Waagepetersen (1998) {barticle}[author] \bauthor\bsnmMøller, \bfnmJesper\binitsJ., \bauthor\bsnmSyversveen, \bfnmAnne Randi\binitsA. R. \AND\bauthor\bsnmWaagepetersen, \bfnmRasmus Plenge\binitsR. P. (\byear1998). \btitleLog gaussian cox processes. \bjournalScandinavian journal of statistics \bvolume25 \bpages451–482. \endbibitem
• NIJ (2017) {bmisc}[author] \bauthor\bsnmNIJ (\byear2017). \btitleReal-Time Crime Forecasting Challenge. \endbibitem
• Ogata (1988) {barticle}[author] \bauthor\bsnmOgata, \bfnmYosihiko\binitsY. (\byear1988). \btitleStatistical models for earthquake occurrences and residual analysis for point processes. \bjournalJournal of the American Statistical association \bvolume83 \bpages9–27. \endbibitem
• O’Hagan (1992) {barticle}[author] \bauthor\bsnmO’Hagan, \bfnmAnthony\binitsA. (\byear1992). \btitleSome Bayesian numerical analysis. \bjournalBayesian Statistics \bvolume4 \bpages4–2. \endbibitem
• Pease et al. (1998) {bbook}[author] \bauthor\bsnmPease, \bfnmKenneth\binitsK. \betalet al. (\byear1998). \btitleRepeat victimisation: Taking stock \bvolume90. \bpublisherHome Office Police Research Group London. \endbibitem
• Perry et al. (2013) {btechreport}[author] \bauthor\bsnmPerry, \bfnmWalter L.\binitsW. L., \bauthor\bsnmMcInnis, \bfnmBrian\binitsB., \bauthor\bsnmPrice, \bfnmCarter C.\binitsC. C., \bauthor\bsnmSmith, \bfnmSusan C.\binitsS. C. \AND\bauthor\bsnmHollywood, \bfnmJohn S.\binitsJ. S. (\byear2013). \btitlePredictive Policing: The Role of Crime Forecasting in Law Enforcement Operations \btypeTechnical Report, \bpublisherRAND Corporation, \baddressSanta Monica. \endbibitem
• Porter and Reich (2012) {barticle}[author] \bauthor\bsnmPorter, \bfnmMichael D.\binitsM. D. \AND\bauthor\bsnmReich, \bfnmBrian J.\binitsB. J. (\byear2012). \btitleEvaluating temporally weighted kernel density methods for predicting the next event location in a series. \bjournalAnnals of GIS \bvolume18 \bpages225–240. \bdoi10.1080/19475683.2012.691904 \endbibitem
• Rahimi and Recht (2007) {binproceedings}[author] \bauthor\bsnmRahimi, \bfnmAli\binitsA. \AND\bauthor\bsnmRecht, \bfnmBenjamin\binitsB. (\byear2007). \btitleRandom features for large-scale kernel machines. In \bbooktitleAdvances in neural information processing systems \bpages1177–1184. \endbibitem
• Rahimi and Recht (2008) {binproceedings}[author] \bauthor\bsnmRahimi, \bfnmAli\binitsA. \AND\bauthor\bsnmRecht, \bfnmBenjamin\binitsB. (\byear2008). \btitleWeighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In \bbooktitleAdvances in neural information processing systems \bpages1313–1320. \endbibitem
• Rasmussen and Williams (2006) {bmisc}[author] \bauthor\bsnmRasmussen, \bfnmCarl Edward\binitsC. E. \AND\bauthor\bsnmWilliams, \bfnmChristopher KI\binitsC. K. (\byear2006). \btitleGaussian processes for machine learning. \endbibitem
• Rodrigues and Diggle (2012) {barticle}[author] \bauthor\bsnmRodrigues, \bfnmAlexandre\binitsA. \AND\bauthor\bsnmDiggle, \bfnmPeter J.\binitsP. J. (\byear2012). \btitleBayesian Estimation and Prediction for Inhomogeneous Spatiotemporal Log-Gaussian Cox Processes Using Low-Rank Models, With Application to Criminal Surveillance. \bjournalJournal of the American Statistical Association \bvolume107 \bpages93–101. \endbibitem
• Rosser and Cheng (2016) {barticle}[author] \bauthor\bsnmRosser, \bfnmGabriel\binitsG. \AND\bauthor\bsnmCheng, \bfnmTao\binitsT. (\byear2016). \btitleImproving the Robustness and Accuracy of Crime Prediction with the Self-Exciting Point Process Through Isotropic Triggering. \bjournalApplied Spatial Analysis and Policy \bpages1–21. \endbibitem
• Rosser et al. (2016) {barticle}[author] \bauthor\bsnmRosser, \bfnmGabriel\binitsG., \bauthor\bsnmDavies, \bfnmToby\binitsT., \bauthor\bsnmBowers, \bfnmKate J.\binitsK. J., \bauthor\bsnmJohnson, \bfnmShane D.\binitsS. D. \AND\bauthor\bsnmCheng, \bfnmTao\binitsT. (\byear2016). \btitlePredictive Crime Mapping: Arbitrary Grids or Street Networks? \bjournalJournal of Quantitative Criminology \bpages1–26. \bdoi10.1007/s10940-016-9321-x \endbibitem
• Schölkopf and Smola (2002) {bbook}[author] \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB. \AND\bauthor\bsnmSmola, \bfnmAlexander J\binitsA. J. (\byear2002). \btitleLearning with kernels: support vector machines, regularization, optimization and beyond. \bpublisherthe MIT Press. \endbibitem
• Schutt (1922) {barticle}[author] \bauthor\bsnmSchutt, \bfnmHarold G.\binitsH. G. (\byear1922). \btitleAdvanced police methods in Berkeley. \bjournalNational Municipal Review \bvolume11 \bpages80–85. \bdoi10.1002/ncr.4110110308 \endbibitem
• Shirota et al. (2017) {barticle}[author] \bauthor\bsnmShirota, \bfnmShinichiro\binitsS., \bauthor\bsnmGelfand, \bfnmAlan E\binitsA. E. \betalet al. (\byear2017). \btitleSpace and circular time log Gaussian Cox processes with application to crime event data. \bjournalThe Annals of Applied Statistics \bvolume11 \bpages481–503. \endbibitem
• Snoek, Larochelle and Adams (2012) {binproceedings}[author] \bauthor\bsnmSnoek, \bfnmJasper\binitsJ., \bauthor\bsnmLarochelle, \bfnmHugo\binitsH. \AND\bauthor\bsnmAdams, \bfnmRyan P\binitsR. P. (\byear2012). \btitlePractical bayesian optimization of machine learning algorithms. In \bbooktitleAdvances in neural information processing systems \bpages2951–2959. \endbibitem
• Stein (1999) {bbook}[author] \bauthor\bsnmStein, \bfnmMichael L\binitsM. L. (\byear1999). \btitleInterpolation of Spatial Data: Some Theory for Kriging. \bpublisherSpringer Science & Business Media. \endbibitem
• Taddy (2010) {barticle}[author] \bauthor\bsnmTaddy, \bfnmMatthew A.\binitsM. A. (\byear2010). \btitleAutoregressive Mixture Models for Dynamic Spatial Poisson Processes: Application to Tracking Intensity of Violent Crime. \bjournalJournal of the American Statistical Association \bvolume105 \bpages1403–1417. \bdoi10.1198/jasa.2010.ap09655 \endbibitem
• Taniguchi and Groff (2017) {bmisc}[author] \bauthor\bsnmTaniguchi, \bfnmTravis\binitsT. \AND\bauthor\bsnmGroff, \bfnmElizabeth R.\binitsE. R. (\byear2017). \btitleNear Repeat Burglary Intervention. \endbibitem
• Teh and Rao (2011) {binproceedings}[author] \bauthor\bsnmTeh, \bfnmYee W\binitsY. W. \AND\bauthor\bsnmRao, \bfnmVinayak\binitsV. (\byear2011). \btitleGaussian process modulated renewal processes. In \bbooktitleAdvances in Neural Information Processing Systems \bpages2474–2482. \endbibitem
• Wang, Gerber and Brown (2012) {bincollection}[author] \bauthor\bsnmWang, \bfnmXiaofeng\binitsX., \bauthor\bsnmGerber, \bfnmMatthew S.\binitsM. S. \AND\bauthor\bsnmBrown, \bfnmDonald E.\binitsD. E. (\byear2012). \btitleAutomatic Crime Prediction Using Events Extracted from Twitter Posts. In \bbooktitleSocial Computing Behavioral - Cultural Modeling and Prediction \bpages231–238. \bpublisherSpringer Berlin Heidelberg. \endbibitem
• Zou and Hastie (2005) {barticle}[author] \bauthor\bsnmZou, \bfnmHui\binitsH. \AND\bauthor\bsnmHastie, \bfnmTrevor\binitsT. (\byear2005). \btitleRegularization and variable selection via the elastic net. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume67 \bpages301–320. \endbibitem

## Appendix A Gaussian processes for spatiotemporal learning

### a.1 Spatial statistics

Following Cressie and Wikle (2011), we use Gaussian processes as the fundamental modeling approach for spatiotemporal data. In the particular case of point pattern data such as crime events, we follow Diggle et al. (2013) in considering the log-Gaussian Cox Process.

A Gaussian process is a stochastic model which can be used as a nonparametric prior over functions . See Rasmussen and Williams (2006) for a comprehensive introduction. is defined on some index set and for our purposes we will assume that is real-valued, so . is parameterized by a mean function and a covariance kernel function :

 f∼GP(μ,k(⋅,⋅)) (A.1)

meaning that:

 E[f(x)] =μ(x) (A.2) Cov(f(x),f(x′)) =k(x,x′) (A.3)

The defining feature of a Gaussian process is that at any finite set of indices , the distribution of the vector is a multivariate Gaussian:

 [f(x1),…,f(xn)]⊤∼MVN([μ(x1),…,μ(xn)]⊤,K) (A.4)

where the covariance matrix .

Gaussian processes are often described as a nonparametric model, since the number of parameters grows with the sample size, with attractive posterior consistency results in a simple regression setting (Choi and Schervish, 2007), and and as part of a Cox survival model (Fernandez and Teh, 2016).

### a.2 Scalable kernel methods

The matrix algebra operations required for calculations involving multivariate Gaussians are not scalable to large datasets. Practitioners face the same issue when applying nonlinear kernel methods (Schölkopf and Smola, 2002), as they rely on the calculation and manipulation of an Gram matrix, which corresponds exactly to the covariance matrix parameterized by the covariance kernel in Gaussian processes. While a variety of approaches have been proposed to alleviate this computational difficulty (for a comprehensive comparison in the spatial setting, see Heaton et al. (2017)), we consider random Fourier features (Rahimi and Recht, 2007, 2008) due to the simplicity with which it can be embedded within a larger supervised learning framework.

Random Fourier Features are a randomized approximation yielding a finite-dimensional feature mapping (and corresponding finite-dimensional Reproducing Kernel Hilbert Space) which approximates the original kernel. Proposed in 2007 (Rahimi and Recht, 2007) and marketed as only requiring 3 lines of MATLAB code to apply, the authors won a “Test of Time Award” at NIPS in 2017 for their widespread applicability and elegance.

Recall Bochner’s theorem (for a precise statement in multiple dimensions see (Stein, 1999, p. 24)), which establishes a one-to-one correspondence between a stationary kernel and a positive finite measure . In particular, is the Fourier transform of :

 κ(x−y)=∫Reiω⊤(x−y)dΛ(ω) (A.5)

Recognizing that this infinite dimensional integral is an expectation over the measure suggests that it can be approximated using Monte Carlo sampling. After normalizing appropriately, we treat as a pdf and consider iid samples:

 ω1,…,ωd∼Λ(ω) (A.6)

Then we have the following approximation:

 κ(x−y) ≈1dd∑j=1eiω⊤j(x−y) (A.7) =1dd∑j=1cos(ω⊤j(x−y))+isin(ω⊤j(x−y)) (A.8) =1dd∑j=1cos(ω⊤jx)cos(ω⊤jy)+sin(ω⊤jx)sin(ω⊤jy) (A.9) =⟨Φ(x),Φ(y)⟩ (A.10)

Since we know that the kernel is real-valued, in line (A.9) we ignore the imaginary component and expand using the trigometric identity for . We can now define an explicit feature mapping consisting of the following pairs of elements concatenated together:

 Φj(x)=1√d(cos(ω⊤jx)sin(ω⊤jx)) (A.11)

The elegance of random Fourier features is that in a supervised learning setting with observations and covariates with design matrix , we can immediately consider introducing kernel-based nonlinearities, without changing our learning approach, simply by transforming our design matrix into for a set of random frequencies (here, is not the matrix Taylor expansion of the matrix , but the element-wise computation of cosine on ). In homage to the three lines of MATLAB code, three lines of R code are shown below to transform a design matrix X into a new design matrix Phi assuming a squared exponential kernel with lengthscale 1, . In our space-time setting, X has 3 columns giving the x, y, and t coordinates of the observations, but it could also include covariates if available.

Omega = matrix(rnorm(d*ncol(X)), d)
Proj = X %*% t(Omega)
Phi = cbind(cos(Proj), sin(Proj)) / sqrt(d)


Changing the covariance kernel’s lengthscale corresponds to changing the variance of the normal distribution. Using a Matérn kernel instead of a squared exponential requires sampling from a Student-t distribution instead of a normal distribution. At this point, any (suitably regularized) linear learning method can be applied to Phi: ridge regression with Phi is an approximation to kernel ridge regression with X; Bayesian linear regression with Phi is an approximation to Gaussian process regression with X.

## Appendix B Hyperparameter choice

Portland, like many cities, has a mix of north/south and east/west aligned streets. However, it also has a non-trivial number of obliquely-oriented streets and parcels. This is especially the case in downtown Portland, east of the Williamette River and south of West Burnside Street. Given the concentration of calls-for-service in this area, it seemed likely that a non-standard alignment could be beneficial, especially for all calls-for-service. For this reason, grid angle of rotation was included as another parameter to be learned by the model, an idea first proposed by (Johnson et al., 2009).

Non-uniform land use coupled in Portland with the common practice of geocoding crime data to the street grid suggested that a fixed N/S oriented square tessellation could be sub-optimal (Rosser et al., 2016).131313Data provided by NIJ included calls geocoded to the building footprints and then offset several feet onto the street grid directly in front of the relevant address. However, contest rules required that shapes be polygons that could be tesselated without rotation. While squares were the simplest choice, the nature of the calls-for-service data geocoded to the street grid suggested that rectangles could be preferable141414We also considered the other regular tesselations of the plane – namely, equilateral triangles and regular hexagons. Both suffered from computational problems, as no extant open libraries offer scalably-fast kernel density estimation over non-rectangular polygons. Nevertheless, some firms (notably Uber) use hexagons at scale for spatiotemporal forecasting.. For this reason, we chose to leave the cell shape as a parameter to be optimized with each crime type and forecasting period potentially receiving its own optimized solution. Similarly, contest rules permitted a variety of different cell sizes and consistent with recent working demonstrating the sensitivity of forecast accuracy to cell size(Hart and Zandbergen, 2014), we left this as a parameter to be optimized. Subject to processing resource limitations, theoretically, any parameter could be optimized for forecasting accuracy.

The final hyperparameters we selected are shown in Table A1. Winning entries are highlighted in yellow.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters