Beyond Points and Paths: Counting Private Bodies
Abstract
Mining of spatial data is an enabling technology for mobile services, Internetconnected cars, and the Internet of Things. But the very distinctiveness of spatial data that drives utility, comes at the cost of user privacy. In this work, we continue the tradition of privacypreserving spatial analytics, focusing not on point or path data, but on planar spatial regions. Such data represents the area of a user’s most frequent visitation—such as “around home and nearby shops”. Specifically we consider the differentiallyprivate release of data structures that support range queries for counting users’ spatial regions. Counting planar regions leads to unique challenges not faced in existing work. A user’s spatial region that straddles multiple data structure cells can lead to duplicate counting at query time. We provably avoid this pitfall by leveraging the Euler characteristic. To address the increased sensitivity of range queries to spatial region data, we calibrate privacypreserving noise using bounded user region size and a constrained inference that uses robust least absolute deviations. Our novel constrained inference reduces noise and introduces covertness by (privately) imposing consistency. We provide a full endtoend theoretical analysis of both differential privacy and highprobability utility for our approach using concentration bounds. A comprehensive experimental study on several realworld datasets establishes practical validity.
I Introduction
The ubiquity, quality and usability of locationbased services supports the ready availability of user tracking. Location data sharing is used across a wide range of applications such as traffic monitoring, facility location planning, recommendation systems and contextual advertising. The distinctiveness of location data, however, has led to calls for location privacy [1, 2]: the ability to track users in aggregate without breaching individual privacy.
Typical private spatial analytics supports point locations. Points and trajectories, however, do not bestrepresent user location in all applications. In facilityservices planning, a planner may wish to locate a new department store in a location that overlaps with users’ regions of frequent visitation. While hotelbooking sites collect arealevel information about customers’ preferred destinations. Such problems motivate our focus on counting private planar bodies^{1}^{1}1We use body and region interchangeably to refer to a user’s spatial area.. Given a collection of privacysensitive planar bodies representing regions of frequent location, we wish to support counting range queries while preserving individual privacy. Fig. 1 illustrates this task, on a map of metropolitan Melbourne with planar bodies representing regions of individual users’ frequent visitation. Third parties may wish to submit any number of queries requesting the number of users’ areas falling in a specified query region, e.g., for urban transport planning or retail analytics.
A leading approach for responding to range queries in spatial data analytics is aggregation [3, 4, 5, 6, 7]. Initial interest in aggregation was due to computational efficiency. In the setting of planar bodies, conventional gridpartitioned histograms cannot provide accurate results due to the duplicate counting^{2}^{2}2In the literature, the terms multiple, double or distinct counting are used interchangeably. We suggest the term “duplicate” as it conveys that objects are overcounted. problem as a planar body may span more than one histogram cell simultaneously. This is a problem unique to counting planar bodies. To address this challenge, we leverage the Euler characteristic [8] where face, edge and vertex counts are stored separately. Such Euler Histograms [9] permit exact counting of convex planar bodies [10].
The recently emerged strong guarantee of differential privacy [11] has attracted a number of researchers in location privacy. Typically work studies aggregation of point and trajectory data [12, 13, 14, 15, 16], often via histogramlike data structures–regular or hierarchical–for controlling the level of perturbation required for privacy.
Our goal in this paper is to address the accurate counting of planar bodies, while providing the strong guarantee of differential privacy. While Euler histograms provide an excellent starting point in terms of utility, computational efficiency and aggregationbased qualitative privacy, a service provider may be directed by users to provide strong semantic privacy. Differential privacy guarantees that an attacker with significant prior knowledge and computational resources cannot determine presence or absence of a user in a set of planar bodies.
The challenge in combining the ideas of Euler histograms and differential privacy is that the data structure’s large number of counts require randomised perturbation. As a result, the total noise added could be prohibitively high. Compared to point data in which at most one cell is impacted per record, here an object could span more than one cell, impacting many counts. Naive solutions would therefore significantly degrade utility. Moreover when sampled independently, perturbations can destroy the consistency of query responses over the resulting structure [17].
The first stage of our approach is to perturb counts of a Euler histogram by applying noise controlled via sensitivity to a natural bound on planar body size. Then, to reinstate consistency and improve utility with no cost to privacy, we apply constrained inference that seeks to minimally update counts to satisfy consistency constraints. These constraints reflect relationships between data structure counts that must exist, but may be violated by perturbation. Under these constraints we apply least absolute deviations (LAD), which is more robust to outliers than ordinal regression—used previously for constrained inference in differential privacy. By enforcing consistency, we also “average out” previouslyadded noise, thereby improving utility in certain cases. Finally, we round counts so that query responses are integral. This final stage, combined with consistency yields responses that preserve a covertness property such that third party observers cannot determine that privacypreserving perturbation has taken place.
Our focus is on the noninteractive privacy setting, wherein our mechanisms release privacypreserving data structures to third parties, with no limitation on the number of subsequent query responses permitted.
Contributions. We deliver several main contributions:

For the first time, we address the differentiallyprivate counting of planar bodies in the noninteractive setting;

We propose differentiallyprivate mechanisms that leverage the Euler characteristic (via the Euler histogram data structure) to address the duplicate counting problem;

We formulate novel constrained inference to reduce noise and introduce consistency based on the robust method of least absolute deviations; combined with rounding, this guarantees a covertness property;

We contribute an endtoend theoretical analysis of both highprobability utility and differential privacy; and

We conduct a comprehensive experimental study on realworld datasets, which confirms the suitability of our approach to private range queries on spatial bodies.
Ii Related Work
Aggregation under range queries has emerged as a fundamental primitive in spatial analytics [3, 4, 5, 6, 7]. Originally motivated by statistical and computational efficiency, aggregation is now also used for qualitative privacy.
A key challenge in aggregation is the distinct counting [3, 4, 5, 6, 7] or multiplecount problem [10]. In contrast to point objects, a spatial body can span more than one cell in a partitioned space, inhibiting the ability of regular histograms to form accurate counts. Euler histograms [9] are designed to address this problem for convex bodies [10], by appealing to Euler’s formula from graph theory [8]. A variation of Euler histogram has been studied for trajectory data to address aggregate queries on moving objects [18]. In that work, Euler histograms were used in a distributed settings (motivating a distributed Euler histogram), to tackle the duplicate (distinct) entry problem rather than duplicate (distinct) counting. There is a line of work [19], in which the CASE histogram has been proposed as a privacypreserving approach for trajectory data analytics, where only counts data is utilised in a partitioned space applying the Euler characteristic to address duplicate counting. The authors in [19] discuss the interactive setting for differentially private Euler histogram release, which has a prohibitive limitation of the number of queries being linear in the number of bodies. Our work has no such limitation (see [20]).
Differential privacy [11, 20] has now become a preferred approach to data sanitisation as it provides a strong semantic guarantee with minimal assumptions placed on the adversary’s knowledge or capabilities. Due to its popularity, differential privacy has been applied to many algorithms and across many domains, such as specialized versions of spatial data indexing structures designed with differential privacy for the purpose of private record matching [12]; in spatial crowdsourcing to help volunteer workers’ locations remain private [21]; in machine learning, releasing differentiallyprivate learned models of SVM classifiers [22]; and for modelling human mobility from realworld cellular network data [23].
Within the scope of aggregation, studies in the area of point privacy have also proposed sanitization algorithms for generating differentially private histogram and releasing aggregate statistics. Many studies have looked at differential privacy of point sets [12, 14, 13, 16, 15, 24]. They have studied regular grid partitioning data structures and hierarchical structures. This work for the first time addresses the problem of differentiallyprivate counting of planar bodies.
Iii Preliminaries
Iiia Euler Histograms
One natural but qualitative approach to privacy preservation is spatial aggregation. We will leverage a data structure that permits spatial aggregation for body counts. Given a grid partitioned space, an Euler histogram data structure allocates buckets not only for grid cells, but also for grid cell edges and vertices. We formally define the data structure as below.
Definition 1.
Consider an arbitrary partition of a subset of into convex cells. Define , , to be index sets over the partition’s faces, edges (face intersections), and vertices (edge intersections). Let be a vector with components the faces, edges and vertices indexed by (i.e., each represents a face/edge/vertex area of the Euclidean plane); and let vector of nonnegative integers be indexed by as well (representing counts per face/edge/vertex). Then we call the data structure an Euler histogram.
For example, an Euler histogram could be defined over a Voronoi partition defined by a finite set of sensors [18]; or a rectangular partition over an urban area [19] such as in Fig. 2.
Beigel and Tanin [9] first introduced to spatial databases, the observation that the Euler characteristic [8] (including its extensions to higher dimensions) directly applies to this data structure. Euler’s characteristic states that the number of convex bodies overlapping certain query regions can be computed exactly as
(1) 
where are the sum of face, edge, and vertex counts in within the given query region (QR in Fig. 2). Duplicate counting due to summing face counts is corrected by subtracting edge counts. This in turn can overcompensate, and is corrected by adding vertex counts. This is a special case of the InclusionExclusion Principle of set theory and applied probability. Fig. 2 illustrates the impact two planar bodies have on a squarepartition Euler histogram. Compared to conventional histograms, with the use of extra counts for grid cell edges and vertices, large objects spanning more than one cell are now distinguishable from several small objects intersecting only one cell. Applying (1) to calculate the number of objects inside the highlighted QR of Fig. 2, we arrive at the correct answer of .
IiiB Differential Privacy
We consider statistical databases on records—each representing a user’s spatial region. Randomisation is vital for preventing an adversary from inverting a released statistic to reconstruct the original (private) data.
Definition 2.
A randomised mechanism , is said to preserve differential privacy for , if for all neighbouring databases , , which differ in exactly one record, and measurable :
Definition 2 implies that an algorithm is differentially private if a change, addition or deletion of a record, does not significantly affect the output distribution. Differential privacy has become a de facto standard for privacy of input data to statistical databases due to it being a semantic guarantee [11].
Iv Problem Statement
The focus of this paper is response to range queries over spatial datasets consisting of a spatial region per user.
Problem 1.
Given a set of planar bodies, our goal is to batch process them to produce a data structure that can respond to an unlimited number of range queries within some fixed, bounded area: given a query region QR, we are to respond with an approximate count of bodies overlapping that region.
For example, a range query covering the entire area in Fig. 1 might elicit a response of (exact count of) 12.
Iva Evaluation Metrics
We consider four properties of mechanisms, as competing metrics for evaluating solutions to Problem 1.

Utility: We measure utility by the absolute error of query responses relative to the true count of bodies intersecting a given query region.

Privacy: Mechanisms should achieve noninteractive differential privacy, at some level , in their release of a data structure on sensitive spatial data.

Consistency: If responses to all possible queries agree with some fixed set of bodies then we say that the mechanism is consistent. Such a set of bodies need not coincide with the original input bodies.

Covertness: If a consistent counting mechanism’s query responses are integervalued, then we also call it covert.
Utility and privacy are in direct tension, for establishing privacy typically involves reducing the influence of data on responses. However for fixed levels of privacy, for example, we can ask what levels of utility are possible for available solutions to Problem 1.
If privacypreserving perturbations are made independently across a data structure, it is unsurprising that overlapping queries will not necessarily result in consistent responses. This may be undesirable for some applications that utilise multiple, overlapping queries e.g., urban planning. We consider specific consistency constraints which relate to the data structure adopted. As such, the level of consistency can be benchmarked according to the number of consistency violations suffered. Unlike privacy, consistency is not necessarily at odds with utility: indeed we will demonstrate how imposing consistency can actually improve utility. Intuitively, if privacypreservation involves injecting independent, random perturbations to a data structure, then consistency corresponds to a smoothness assumption that can be used to ‘cancel out’ the deleterious effect of perturbation. Consistency may also be applied when a measure of ‘stealth’ is desired for a counting mechanism.
IvB Assumptions
The theoretical guarantees developed in this paper leverage four assumptions (cf. Fig. 3). Each is relatively weak, being well motivated and satisfied in most practical settings.

We assume that the space partition’s cells are all convex.

We assume that query regions are convex unions of our space partition’s cells.

We assume that all planar bodies are convex.

We assume that all planar bodies are of some bounded diameter .
Our first three assumptions are sufficient for guaranteeing correctness (perfect utility) for Euler histograms. Relaxing these assumptions may come at the cost of utility. For example convex query regions that are not unions of cells can exactly count the number of bodies in the (enlarged) union of cells intersecting the QR. And general query regions will still result in excellent utility. Two important partition geometries satisfy these conditions: rectangular and Voronoi partitions.
The fourth assumption controls the Lipschitz smoothness of Euler histogram counts with respect to input bodies. This parameter—also known as the global sensitivity (cf. Definition 3)—calibrates the scale of noise added for differential privacy. We consider a motivating example to be regions of frequent visitation. These are necessarily bounded. With sufficiently large, no restriction is made on valid bodies.
Without loss of generality we assume partitions are square of side length , divided into rows and columns, yielding square cells of side length (cf. Fig. 3).
V Algorithms and Analysis
Our approach consists of four complementary algorithms.
Va Algorithm: Euler
Algorithm 1 creates a data structure (Euler histograms cf. Sec. IIIA) to represent aggregated counts of a given set of convex planar bodies . The algorithm simply increments counts for any face, edge, vertex that intersects a body.
Privacy.
Euler is qualitatively private via aggregation,
but it does not achieve any differential privacy being deterministic.
Corollary 1.
If input bodies, partition cells, and query region are convex, and the query region is a union of cells, then Euler’s responses to the range query via (1) are accurate.
Computational Complexity. As our partition has rows and columns, Euler’s time and space complexities are efficient at and respectively.
VB Algorithm: DiffPriv
Euler achieves a number of our target properties but not differential privacy. We now introduce differential privacy to our approach by perturbing Euler histogram counts. In Algorithm 2, we add carefullycrafted random noise
based on the sensitivity of the histogram to input bodies.
We truncate any resulting negative counts to zero, improving utility at no cost to privacy.
Privacy. The key step to establishing the differential privacy of DiffPriv, is to calculate Lipschitz smoothness for Euler—the scale of noise to be added to reduce sensitivity.
Definition 3.
Let be a deterministic, realvectorvalued function of a database. The global sensitivity (GS) of is given by , taken over all neighbouring pairs of databases.
The global sensitivity is a property of function , independent of input database. For Euler histograms, the GS measures the effect on the histogram count vector, due to changing an input planar body related to a user’s spatial region.
Lemma 1.
The global sensitivity of Euler is , where is the cell side length, and is an bound on planar body diameter.
Proof:
By Assumption 4 (cf. Fig. 3), the number of cells that could intersect with a body is at most in one direction. Therefore the total number of cells that could intersect a body is From this the number of faces, edges and vertices of partition intersecting with a body can be upperbounded as
#Faces  
#Edges  
#Vertices 
Summing these, we may bound the total number of partition components intersected by the body as
Since changing a single body in a database can affect impacted histogram cell counts by one, this expression is also a bound on global sensitivity. ∎
DiffPriv applies the Laplace mechanism [11] to Euler: it adds to a nonprivate vectorvalued function , i.i.d. Laplacedistributed noise with centre zero and scale given by , for desired privacy level . Here, .
Corollary 3.
DiffPriv preserves differential privacy.
Proof:
The result follows by applying the triangle inequality to the odds ratio using the definition of Laplace density, and global sensitivity [11]. ∎
Utility. DiffPriv is neither covert nor consistent, however we can bound its utility.
Theorem 1.
For confidence level , the counts output by Euler and counts output by DiffPriv are uniformly close with high probability
Proof:
For convenience, we define the combined index set , noting that . Recall that by the definition of DiffPriv, we have that
By the cumulative distribution function of the zeromean Laplace, it follows that
for any scalar . By the union bound it follows that
Applying De Morgan’s law,
Solving yields so that
The result follows from iid. ∎
Computational Complexity. On our rows/column partition, DiffPriv’s time/space complexities are efficient .
VC Algorithm: Linear Programming
After adding randomised noise with DiffPriv, we apply constrained inference to smooth this noise, as detailed below. We first begin by defining constrained inference, followed by a set of consistency constraints.
VC1 Constrained Inference: LAD
Constrained inference models the noisy counts output by DiffPriv as noisy observation of latent counts which are themselves related according to a set of constraints. Inference effectively smooths the differentiallyprivate release, potentially improving utility without affecting privacy. Previously ordinary least squares (OLS) has driven constrained inference [14, 25]. Here we propose instead to use least absolute deviation (LAD) (also referred to as least absolute residuals, least absolute errors and least absolute value) [26]. In contrast to OLS, LAD has the benefit of being robust to outliers. LAD is ideal for our setting, since its choice of minimising error corresponds to maximising the exponential of the negative : a Laplace noise model, akin to maximumlikelihood estimation, matching DiffPriv precisely.
Definition 4.
Let be the Euler histogram counts with a set of defined constraints, . Given noisy histogram counts, , constrained LAD inference returns vector , that satisfies the constraints while minimising .
Consistency. We define three constraints C1, C2 and C3 for Euler histograms as follows. Our consistency constraints consider the relationships between face, edge and vertex counts. Every increment to an edge count must correspond to an increment to the counts of both incident faces as well; and similarly for an increment to a vertex count, the corresponding four incident edge counts must be incremented. Finally query regions should respond with nonzero count estimates. These represent the intuition behind our three sets of consistency constraints. For ease of exposition, we refer to face, edge and vertex components of by respectively. The meaning will be apparent from context.
Constraint 1.
Every edge count is less than or equal to the minimum value of its two incident faces.
Constraint 2.
Every vertex count is less than or equal to its four incident edges’ counts.
Constraint 3.
Algorithm. We consider two constrained inference programs for enforcing these constraints. Both minimise the change to the histogram counts subject to the constraints. The first, LAD, minimises counts with respect to the norm.
By introducing a primal variable per histogram cell count, we can transform this to the following linear program
(2)  
s.t.  
Alternatively we could adopt the norm for minimising the change to the histogram cell counts, as in the following program.
And again we may transform this program to an equivalent LP, this time introducing only a single new primal variable
(3)  
s.t.  
We analyse Program (3), however we recommend that in practice Program (2) be used since it is better able to minimise change to all cell counts, while Program (3) only minimises the maximum error. Algorithm 3 and our experiments reflect this recommendation.
Privacy.
Since LinProg depends only on the output of DiffPriv, it preserves the same level of differential privacy.
Utility. We can establish highprobability utility bounds on LinProg that take a similar form to those proved for DiffPriv, but via different arguments.
Theorem 2.
For any confidence level , and for histogram counts output by DiffPriv and minimising Program (3), we have
Proof:
We reduce to the bound on DiffPriv, by noting that since LinProg is minimising distance, the distance from to must be no more than to . In other words
with the final bound holding w.p. at least . ∎
Computational Complexity. Linear programming interiorpoint methods—also referred to as barrier algorithms—are polynomialtime, with worstcase complexity of [27], for , the number of variables. Therefore, for Euler histograms the time complexity is , but in practice it is efficient as demonstrated in our runtime experiments (cf. Sec. VIH).
VD Algorithm: Rounding
After running LinProg, we introduce covertness via Round. This allows the curator to hide that the data has been perturbed.
Privacy.
Since Round depends only on differentiallyprivate data, it also preserves differential privacy.
Utility. The analysis of utility for Round is more straightforward than for DiffPriv and LinProg.
Lemma 2.
If is the output histogram of LinProg and is the result of Round, then .
Lemma 3.
Round is consistent when run after LinProg, and so it is also covert.
Proof:
We only need to check the consistency constraints, as to whether Round violates any. This cannot happen, since the smaller side of a constraint inequality rounding up must coincide with the larger side rounding up. Similarly the larger side rounding down must coincide with the smaller side doing the same. Therefore, consistency is invariant to rounding. ∎
Computational Complexity. Similar to DiffPriv, Round’s time and space complexities are an efficient .
VE Full Theoretical Analysis
We are now able to combine the individual utility analyses of the four stages of our approach, into an overall highprobability bound on utility.
Corollary 4.
For confidence level , and histogram counts , output by Euler and Round respectively we have that
holds with probability at least .
Proof:
Note, the utility bound’s error is w.h.p.
Remark 1.
In order to achieve appropriate utility, we recommend selecting cell size , based on third party requirements. The smallest QR that a third party might run on an area is a reasonable choice for . can naturally be set by users or service provider. There is little risk that would be made too large, as a user cannot have a very large region representing their regular location in a short time interval. In e.g., fitness applications, users can determine their area that they usually do their workouts.
Vi Experimental Study
Via Datasets
We conduct extensive experiments on three realworld datasets, that vary in terms of density and concentration of locations. One dataset records GPS coordinates of more than 500 taxis over 30 days in the San Francisco Bay Area. Cab mobility traces are provided through the cabspotting project [28]. Here, cabs’ GPS points are more concentrated on the financial district and surrounding areas (cf. Fig. (a)a); we select this area for the empirical study (cf. Fig. (c)c). Our remaining datasets are in Beijing (Microsoft Research Asia), Geolife project Version 1.3 [29], as well as TDrive [30]. In Geolife 1.3, GPS trajectories were collected by 182 users, containing 18,000 trajectories. 91.5 percent of the trajectories are logged in a dense representation (every 1–5 seconds or every 5–10 meters per point). GeoLife dataset gathered a broad range of users’ outdoor movements, including not only everyday routines e.g., going home and commuting to work but also entertainment and sporting activities, including shopping, sightseeing, dining, hiking, and cycling. TDrive includes the GPS trajectories of about 10,000 taxis within Beijing, with a total number of points at about 15 million. Compared to GeoLife, TDrive has a relatively better distribution of users’ spatial regions in a partitioned space.
ViB Preprocessing
We preprocess each dataset to extract convex planar bodies, representing regions where users mostly frequent. This simulates a real application where extraction might be conducted at the end point e.g., in a fitness tracker where users can set their workout area.

Fit a kernel density estimate (KDE) and consequently take the mode of each user’s set of GPS points;

Take knearest neighbours (kNN) points to the mode, e.g., for GeoLife, 8 hours corresponds to . If the number of GPS points are less than we take all points;

Check if all the points are within the defined diameter, otherwise discard outliers; and

Compute the convex hull of remaining points to create a convex planar body representing an area of frequent visitation.
Fig. 8 demonstrates the trajectory of a cab in San Francisco (a)a, taken from the Cabspotting project. In this picture (cf. Fig. (b)b), the level sets within the contour lines are convex, and we could have picked these for our convex planar body. But in general level sets are not convex. Our approach generates a convex approximation. As depicted in Fig. (c)c, cab GPS points in this dataset are dense and concentrated in a specific area. Fig. (d)d illustrates the extracted convex body.
After preprocessing, we create histogram counts per each convex body, to construct the Euler histograms as our baseline approach and as the basis for our other algorithms.
ViC Parameter Settings
Initial settings for Beijing with four parameters (area side length), (cell size), (bounded diameter), are , , and respectively. These settings are applied on TDrive, and GeoLife1.3 datasets. With regard to San Francisco, Cabspotting dataset, area size is , and cell size is but the remaining parameters are the same (cf. Table I).
Even though the literature on point data [14, 15] tends to use only specific QR sizes, we vary the QR parameter over the entire range of the area size to more fully evaluate our technique. For experiments where we compare histograms, the ratio, which defines the number of grid cells for each axis, has been kept constant for all datasets (cf. Sec. VIH).
ViD Evaluation Metrics
Apart from the varying parameter, we keep all other parameters fixed to compute the median relative error as an empirical measure of utility, as is standard [14, 15]. We repeat each of the experiments 100 times and compute median relative error. The baseline approach is Euler as it provides exact answers. Algorithms DiffPriv, LinProg, Round that are privacypreserving, are compared to Euler. Furthermore, we compute the running time for each algorithm (cf. Sec. VIH).
Dataset  Cell Size (d)  B  Area Size (A)  A/d  QR Size/Shape  

TDrive  1km  2km  20km*20km  20  110%  1 
TDrive  1km  2km  20km*20km  20  10100%  1 
TDrive  0.66,1,2km  2km  20km*20km  30,20,10  1%  1 
TDrive  2km  2km  20km*20km  10  1%  0.1,0.4,0.7,1 
GeoLife1.3  1km  2km  20km*20km  20  110%  1 
GeoLife1.3  1km  2km  20km*20km  20  10100%  1 
Cabspotting  0.8km  2km  3.2km*3.2km  4  10100%  1 
ViE Varying Query Rectangle Size
In this section we compute the median relative error on all datasets, representing diversity in terms of sparsity, density and concentration, to demonstrate effect on accuracy. We fix every parameter, except QR size to run a range query on various sizes, with varying position on the partitioned map, based on definition of a QR as a union of grid cells. Range queries are varied from 1 to 10 and 10 to 100 percent of the total area size of the respective city. The results for various sizes as well as shapes of a range query are shown in Figs. 13–21. Various parameters can affect the response of a QR, including shape of a QR, size of a QR, whether convex bodies are sparse in the space or dense, or if they are concentrated or not. Furthermore, the computed global density, see Lemma 1, is different for different dataset settings, e.g., for both TDrive and GeoLife datasets, and for Cabspotting, and this value also affects the results. The similarity between TDrive and Cabspotting is that both record taxi driver movements; but a difference is that the former is not concentrated on a specific area while the latter is. In GeoLife1.3 the convex bodies are more dense, having a large number of trajectories.
As depicted in Fig. 13 for the TDrive dataset, since the data is more evenly distributed the error is very low for larger QR sizes (Fig. (b)b), and is less than for smaller QRs (Fig. (a)a). A variety of QR shapes for the smaller sizes (Fig. (c)c), and larger ones (Fig. (d)d) are depicted accordingly. For instance, 1% QR in a partitionedmap of Beijing city could be , , geometries, first number represents the number of rows and the second one shows the number of columns. Compared to GeoLife1.3 (Fig. 21), since trajectories are more focused on some area, the error increases by decreasing QR size (Fig. (a)a). With regard to the Cabspotting dataset (Fig. 21), some parts of the selected area are sparser which consequently affects the result of DiffPriv for the QR sizes of 50% and 60%, as they contain dense and sparse cells. However for larger QRs errors cancel each other out due to the Euler formula (1). In all cases, LinProg and Round reduce the errors, and provide a high level of accuracy. Since the number of spatial partitions for the chosen area is smaller than the other datasets, only QR sizes and shapes between 10%–100% are shown in Figs. (a)a and (b)b. The QR errors for the smaller sizes 1%–9% are less than 10%.
LinProg and Round provide similar results, and as discussed in Sec. V, the difference is the covertness property of Round.
ViF Varying Area Size/Grid Cell Size Ratio
We vary the area size (A) over grid cell size (d) ratio and compute the median relative error for QR taken as 1% of total area of TDrive dataset. The area size for this dataset is . By increasing the cell size, we expect that the accuracy improves, as demonstrated in Fig. 23. We have fixed the QR as 1%, and varied the size of the grid cell in a range 0.66km, 1km, and 2km to yield the ratios of 30, 20, and 10 respectively. As shown, by increasing the grid cell size the accuracy increases. As illustrated in Fig. 23, as we decrease the grid cell size, the error increases due to higher values of global sensitivity for smaller cell sizes: , , are the global sensitivity (GS) values for , , and cell sizes respectively. If we wish to decrease without incurring reduced accuracy, our theoretical results suggest that we should also decrease and .
ViG Varying Privacy Parameter
We apply a similar procedure to vary the privacy parameter across values 0.1, 0.4, 0.7, and 1 with fixed QR of 1% of the total area , and cell size . The effect of increasing on accuracy is depicted in Fig. 23. Decreasing the epsilon value from , will increase the scale parameter of Laplace distribution (added noise to the counts) from to for , and this affects the accuracy of the result. To keep accuracy relatively constant when reducing , the third party can vary other parameters.
ViH Running Time
Fig. 24 shows running times for all datasets of various sizes. As discussed in Sec. VIC, we kept the ratio fixed. The running time for all the datasets are approximately similar per each technique. The yaxis is in seconds (logscale) and for the largest dataset GeoLife1.3, the total running time is 196 seconds. DiffPriv, LinProg and Round take less than 1 second for all the datasets. Each of our algorithms are eminently practical to implement and to run.
Vii Concluding Remarks
For the first time we propose a noninteractive differentiallyprivate approach to counting planar bodies representative of users’ spatial regions e.g., a workout area, areas of customer preference for hotel bookings, or locations of frequent visitation for facility planning.
The key insight of our approach is to leverage Euler histograms for accurate counting, cell perturbations for differential privacy, and constrained inference smoothing to reinstate consistency. Constrained inference often improves utility by cancelling noisy perturbations. Our formulation of constrained inference is a novel constrained application of the robust method of least absolute deviations. Unlike existing constrained inference based on ordinal regression, our formulation precisely matches our privacypreserving cell perturbation distribution. By optimising for consistency while rounding cell counts, we achieve a covertness property for our counting mechanism: third parties cannot determine that we have perturbed data in the first place.
A full theoretical analysis of utility and differential privacy is complemented by experimental results on three datasets.
Acknowledgement
This work was supported in part by Australian Research Council DECRA grant DE160100584.
References
 [1] G. Ghinita, Privacy for Locationbased Services, ser. Synthesis Lectures on Information Security, Privacy, and Trust. Morgan & Claypool, 2013.
 [2] J. Krumm, “Inference attacks on location tracks,” in Pervasive Computing, PERVASIVE’07, 2007, pp. 127–143.
 [3] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao, “Efficient OLAP operations in spatial data warehouses,” in SSTD’01, 2001, pp. 443–459.
 [4] Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias, “Spatiotemporal aggregation using sketches,” in ICDE’04, 2004, pp. 214–225.
 [5] I. F. V. López, R. T. Snodgrass, and B. Moon, “Spatiotemporal aggregate computation: a survey,” IEEE Trans. KDE, vol. 17, no. 2, pp. 271–286, 2005.
 [6] F. Braz, S. Orlando, R. Orsini, A. Raffaetà, A. Roncato, and C. Silvestri, “Approximate aggregations in trajectory data warehouses,” in ICDE’07, 2007, pp. 536–545.
 [7] L. Leonardi, S. Orlando, A. Raffaetà, A. Roncato, C. Silvestri, G. L. Andrienko, and N. V. Andrienko, “A general framework for trajectory data warehousing and visual OLAP,” GeoInfo., vol. 18, no. 2, pp. 273–312, 2014.
 [8] R. Trudeau, Introduction to Graph Theory. Dover, 1993.
 [9] R. Beigel and E. Tanin, “The geometry of browsing,” in LATIN ’98, 1998, pp. 331–340.
 [10] C. Sun, D. Agrawal, and A. El Abbadi, “Selectivity estimation for spatial joins with geometric selections,” in EDBT’02, 2002, pp. 609–626.
 [11] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC’06, 2006, pp. 265–284.
 [12] A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino, “Private record matching using differential privacy,” in EDBT’10, 2010, pp. 123–134.
 [13] R. Chen, B. C. M. Fung, B. C. Desai, and N. M. Sossou, “Differentially private transit data publication: a case study on the Montreal transportation system,” in KDD’12, 2012, pp. 213–221.
 [14] G. Cormode, C. M. Procopiuc, D. Srivastava, E. Shen, and T. Yu, “Differentially private spatial decompositions,” in ICDE’12, 2012, pp. 20–31.
 [15] W. H. Qardaji, W. Yang, and N. Li, “Differentially private grids for geospatial data,” in ICDE’13, 2013, pp. 757–768.
 [16] X. He, G. Cormode, A. Machanavajjhala, C. M. Procopiuc, and D. Srivastava, “DPT: differentially private trajectory synthesis using hierarchical reference systems,” PVLDB, vol. 8, no. 11, pp. 1154–1165, 2015.
 [17] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar, “Privacy, accuracy, and consistency too: a holistic solution to contingency table release,” in PODS’07, 2007, pp. 273–282.
 [18] H. Xie, E. Tanin, and L. Kulik, “Distributed histograms for processing aggregate data from moving objects,” in MDM’07, 2007, pp. 152–157.
 [19] M. Fanaeepour, L. Kulik, E. Tanin, and B. I. P. Rubinstein, “The CASE histogram: privacyaware processing of trajectory data using aggregates,” GeoInformatica, pp. 1–52, 2015.
 [20] C. Dwork, “Differential privacy: A survey of results,” in TAMC’08, 2008, pp. 1–19.
 [21] H. To, G. Ghinita, and C. Shahabi, “A framework for protecting worker location privacy in spatial crowdsourcing,” PVLDB, vol. 7, no. 10, pp. 919–930, 2014.
 [22] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft, “Learning in a large function space: Privacypreserving mechanisms for SVM learning,” J. Privacy and Confidentiality, vol. 4, no. 1, pp. 65–100, 2012.
 [23] D. J. Mir, S. Isaacman, R. Cáceres, M. Martonosi, and R. N. Wright, “DPWHERE: differentially private modeling of human mobility,” in BigData’13, 2013, pp. 580–588.
 [24] C. Li, M. Hay, G. Miklau, and Y. Wang, “A data and workloadaware query answering algorithm for range queries under differential privacy,” PVLDB, vol. 7, no. 5, pp. 341–352, 2014.
 [25] M. Hay, V. Rastogi, G. Miklau, and D. Suciu, “Boosting the accuracy of differentially private histograms through consistency,” PVLDB, vol. 3, no. 1, pp. 1021–1032, 2010.
 [26] T. E. Dielman, “Least absolute value regression: recent contributions,” J. Stat. Computation and Simulation, vol. 75, no. 4, pp. 263–286, 2005.
 [27] N. Karmarkar, “A new polynomialtime algorithm for linear programming,” in STOC’84, 1984, pp. 302–311.
 [28] M. Piorkowski, N. SarafijanovocDjukic, and M. Grossglauser, “A Parsimonious Model of Mobile Partitioned Networks with Clustering,” in COMSNETS, 2009. [Online]. Available: http://www.comsnets.org
 [29] Y. Zheng, X. Xie, and W. Ma, “Geolife: A collaborative social networking service among user, location and trajectory,” IEEE Data Eng. Bull., vol. 33, no. 2, pp. 32–39, 2010.
 [30] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “Tdrive: driving directions based on taxi trajectories,” in ACMGIS’10, 2010, pp. 99–108.