Beyond Points and Paths: Counting Private Bodies

Beyond Points and Paths: Counting Private Bodies

Maryam Fanaeepour12, Benjamin I. P. Rubinstein1 1Department of Computing and Information Systems,
The University of Melbourne, Australia   
2Data61, CSIRO, Australia
Email: maryamf@student.unimelb.edu.au, brubinstein@unimelb.edu.au
Abstract

Mining of spatial data is an enabling technology for mobile services, Internet-connected cars, and the Internet of Things. But the very distinctiveness of spatial data that drives utility, comes at the cost of user privacy. In this work, we continue the tradition of privacy-preserving spatial analytics, focusing not on point or path data, but on planar spatial regions. Such data represents the area of a user’s most frequent visitation—such as “around home and nearby shops”. Specifically we consider the differentially-private release of data structures that support range queries for counting users’ spatial regions. Counting planar regions leads to unique challenges not faced in existing work. A user’s spatial region that straddles multiple data structure cells can lead to duplicate counting at query time. We provably avoid this pitfall by leveraging the Euler characteristic. To address the increased sensitivity of range queries to spatial region data, we calibrate privacy-preserving noise using bounded user region size and a constrained inference that uses robust least absolute deviations. Our novel constrained inference reduces noise and introduces covertness by (privately) imposing consistency. We provide a full end-to-end theoretical analysis of both differential privacy and high-probability utility for our approach using concentration bounds. A comprehensive experimental study on several real-world datasets establishes practical validity.

I Introduction

The ubiquity, quality and usability of location-based services supports the ready availability of user tracking. Location data sharing is used across a wide range of applications such as traffic monitoring, facility location planning, recommendation systems and contextual advertising. The distinctiveness of location data, however, has led to calls for location privacy [1, 2]: the ability to track users in aggregate without breaching individual privacy.

Typical private spatial analytics supports point locations. Points and trajectories, however, do not best-represent user location in all applications. In facility-services planning, a planner may wish to locate a new department store in a location that overlaps with users’ regions of frequent visitation. While hotel-booking sites collect area-level information about customers’ preferred destinations. Such problems motivate our focus on counting private planar bodies111We use body and region interchangeably to refer to a user’s spatial area.. Given a collection of privacy-sensitive planar bodies representing regions of frequent location, we wish to support counting range queries while preserving individual privacy. Fig. 1 illustrates this task, on a map of metropolitan Melbourne with planar bodies representing regions of individual users’ frequent visitation. Third parties may wish to submit any number of queries requesting the number of users’ areas falling in a specified query region, e.g., for urban transport planning or retail analytics.

A leading approach for responding to range queries in spatial data analytics is aggregation [3, 4, 5, 6, 7]. Initial interest in aggregation was due to computational efficiency. In the setting of planar bodies, conventional grid-partitioned histograms cannot provide accurate results due to the duplicate counting222In the literature, the terms multiple, double or distinct counting are used interchangeably. We suggest the term “duplicate” as it conveys that objects are over-counted. problem as a planar body may span more than one histogram cell simultaneously. This is a problem unique to counting planar bodies. To address this challenge, we leverage the Euler characteristic [8] where face, edge and vertex counts are stored separately. Such Euler Histograms [9] permit exact counting of convex planar bodies [10].

Fig. 1: Example users’ spatial regions on a map of Melbourne.

The recently emerged strong guarantee of differential privacy [11] has attracted a number of researchers in location privacy. Typically work studies aggregation of point and trajectory data [12, 13, 14, 15, 16], often via histogram-like data structures–regular or hierarchical–for controlling the level of perturbation required for privacy.

Our goal in this paper is to address the accurate counting of planar bodies, while providing the strong guarantee of differential privacy. While Euler histograms provide an excellent starting point in terms of utility, computational efficiency and aggregation-based qualitative privacy, a service provider may be directed by users to provide strong semantic privacy. Differential privacy guarantees that an attacker with significant prior knowledge and computational resources cannot determine presence or absence of a user in a set of planar bodies.

The challenge in combining the ideas of Euler histograms and differential privacy is that the data structure’s large number of counts require randomised perturbation. As a result, the total noise added could be prohibitively high. Compared to point data in which at most one cell is impacted per record, here an object could span more than one cell, impacting many counts. Naive solutions would therefore significantly degrade utility. Moreover when sampled independently, perturbations can destroy the consistency of query responses over the resulting structure [17].

The first stage of our approach is to perturb counts of a Euler histogram by applying noise controlled via sensitivity to a natural bound on planar body size. Then, to re-instate consistency and improve utility with no cost to privacy, we apply constrained inference that seeks to minimally update counts to satisfy consistency constraints. These constraints reflect relationships between data structure counts that must exist, but may be violated by perturbation. Under these constraints we apply least absolute deviations (LAD), which is more robust to outliers than ordinal regression—used previously for constrained inference in differential privacy. By enforcing consistency, we also “average out” previously-added noise, thereby improving utility in certain cases. Finally, we round counts so that query responses are integral. This final stage, combined with consistency yields responses that preserve a covertness property such that third party observers cannot determine that privacy-preserving perturbation has taken place.

Our focus is on the non-interactive privacy setting, wherein our mechanisms release privacy-preserving data structures to third parties, with no limitation on the number of subsequent query responses permitted.

Contributions. We deliver several main contributions:

  • For the first time, we address the differentially-private counting of planar bodies in the non-interactive setting;

  • We propose differentially-private mechanisms that leverage the Euler characteristic (via the Euler histogram data structure) to address the duplicate counting problem;

  • We formulate novel constrained inference to reduce noise and introduce consistency based on the robust method of least absolute deviations; combined with rounding, this guarantees a covertness property;

  • We contribute an end-to-end theoretical analysis of both high-probability utility and differential privacy; and

  • We conduct a comprehensive experimental study on real-world datasets, which confirms the suitability of our approach to private range queries on spatial bodies.

Ii Related Work

Aggregation under range queries has emerged as a fundamental primitive in spatial analytics [3, 4, 5, 6, 7]. Originally motivated by statistical and computational efficiency, aggregation is now also used for qualitative privacy.

A key challenge in aggregation is the distinct counting [3, 4, 5, 6, 7] or multiple-count problem [10]. In contrast to point objects, a spatial body can span more than one cell in a partitioned space, inhibiting the ability of regular histograms to form accurate counts. Euler histograms [9] are designed to address this problem for convex bodies [10], by appealing to Euler’s formula from graph theory [8]. A variation of Euler histogram has been studied for trajectory data to address aggregate queries on moving objects [18]. In that work, Euler histograms were used in a distributed settings (motivating a distributed Euler histogram), to tackle the duplicate (distinct) entry problem rather than duplicate (distinct) counting. There is a line of work [19], in which the CASE histogram has been proposed as a privacy-preserving approach for trajectory data analytics, where only counts data is utilised in a partitioned space applying the Euler characteristic to address duplicate counting. The authors in [19] discuss the interactive setting for differentially private Euler histogram release, which has a prohibitive limitation of the number of queries being linear in the number of bodies. Our work has no such limitation (see [20]).

Differential privacy [11, 20] has now become a preferred approach to data sanitisation as it provides a strong semantic guarantee with minimal assumptions placed on the adversary’s knowledge or capabilities. Due to its popularity, differential privacy has been applied to many algorithms and across many domains, such as specialized versions of spatial data indexing structures designed with differential privacy for the purpose of private record matching [12]; in spatial crowdsourcing to help volunteer workers’ locations remain private [21]; in machine learning, releasing differentially-private learned models of SVM classifiers [22]; and for modelling human mobility from real-world cellular network data [23].

Within the scope of aggregation, studies in the area of point privacy have also proposed sanitization algorithms for generating differentially private histogram and releasing aggregate statistics. Many studies have looked at differential privacy of point sets [12, 14, 13, 16, 15, 24]. They have studied regular grid partitioning data structures and hierarchical structures. This work for the first time addresses the problem of differentially-private counting of planar bodies.

Iii Preliminaries

Iii-a Euler Histograms

One natural but qualitative approach to privacy preservation is spatial aggregation. We will leverage a data structure that permits spatial aggregation for body counts. Given a grid partitioned space, an Euler histogram data structure allocates buckets not only for grid cells, but also for grid cell edges and vertices. We formally define the data structure as below.

Definition 1.

Consider an arbitrary partition of a subset of into convex cells. Define , , to be index sets over the partition’s faces, edges (face intersections), and vertices (edge intersections). Let be a vector with components the faces, edges and vertices indexed by (i.e., each represents a face/edge/vertex area of the Euclidean plane); and let vector of non-negative integers be indexed by as well (representing counts per face/edge/vertex). Then we call the data structure an Euler histogram.

Fig. 2: Two convex bodies overlapping a spatial partition and their related counts to corresponding Euler histogram; an example query region (QR).

For example, an Euler histogram could be defined over a Voronoi partition defined by a finite set of sensors [18]; or a rectangular partition over an urban area [19] such as in Fig. 2.

Beigel and Tanin [9] first introduced to spatial databases, the observation that the Euler characteristic [8] (including its extensions to higher dimensions) directly applies to this data structure. Euler’s characteristic states that the number of convex bodies overlapping certain query regions can be computed exactly as

(1)

where are the sum of face, edge, and vertex counts in within the given query region (QR in Fig. 2). Duplicate counting due to summing face counts is corrected by subtracting edge counts. This in turn can over-compensate, and is corrected by adding vertex counts. This is a special case of the Inclusion-Exclusion Principle of set theory and applied probability. Fig. 2 illustrates the impact two planar bodies have on a square-partition Euler histogram. Compared to conventional histograms, with the use of extra counts for grid cell edges and vertices, large objects spanning more than one cell are now distinguishable from several small objects intersecting only one cell. Applying (1) to calculate the number of objects inside the highlighted QR of Fig. 2, we arrive at the correct answer of .

Iii-B Differential Privacy

We consider statistical databases on records—each representing a user’s spatial region. Randomisation is vital for preventing an adversary from inverting a released statistic to reconstruct the original (private) data.

Definition 2.

A randomised mechanism , is said to preserve -differential privacy for , if for all neighbouring databases , , which differ in exactly one record, and measurable :

Definition 2 implies that an algorithm is differentially private if a change, addition or deletion of a record, does not significantly affect the output distribution. Differential privacy has become a de facto standard for privacy of input data to statistical databases due to it being a semantic guarantee [11].

Iv Problem Statement

The focus of this paper is response to range queries over spatial datasets consisting of a spatial region per user.

Problem 1.

Given a set of planar bodies, our goal is to batch process them to produce a data structure that can respond to an unlimited number of range queries within some fixed, bounded area: given a query region QR, we are to respond with an approximate count of bodies overlapping that region.

For example, a range query covering the entire area in Fig. 1 might elicit a response of (exact count of) 12.

Iv-a Evaluation Metrics

We consider four properties of mechanisms, as competing metrics for evaluating solutions to Problem 1.

  1. Utility: We measure utility by the absolute error of query responses relative to the true count of bodies intersecting a given query region.

  2. Privacy: Mechanisms should achieve non-interactive differential privacy, at some level , in their release of a data structure on sensitive spatial data.

  3. Consistency: If responses to all possible queries agree with some fixed set of bodies then we say that the mechanism is consistent. Such a set of bodies need not coincide with the original input bodies.

  4. Covertness: If a consistent counting mechanism’s query responses are integer-valued, then we also call it covert.

Utility and privacy are in direct tension, for establishing privacy typically involves reducing the influence of data on responses. However for fixed levels of privacy, for example, we can ask what levels of utility are possible for available solutions to Problem 1.

If privacy-preserving perturbations are made independently across a data structure, it is unsurprising that overlapping queries will not necessarily result in consistent responses. This may be undesirable for some applications that utilise multiple, overlapping queries e.g., urban planning. We consider specific consistency constraints which relate to the data structure adopted. As such, the level of consistency can be benchmarked according to the number of consistency violations suffered. Unlike privacy, consistency is not necessarily at odds with utility: indeed we will demonstrate how imposing consistency can actually improve utility. Intuitively, if privacy-preservation involves injecting independent, random perturbations to a data structure, then consistency corresponds to a smoothness assumption that can be used to ‘cancel out’ the deleterious effect of perturbation. Consistency may also be applied when a measure of ‘stealth’ is desired for a counting mechanism.

Iv-B Assumptions

The theoretical guarantees developed in this paper leverage four assumptions (cf. Fig. 3). Each is relatively weak, being well motivated and satisfied in most practical settings.

Fig. 3: A convex body with bounded diameter, on a spatial partition.
  1. We assume that the space partition’s cells are all convex.

  2. We assume that query regions are convex unions of our space partition’s cells.

  3. We assume that all planar bodies are convex.

  4. We assume that all planar bodies are of some bounded diameter .

Our first three assumptions are sufficient for guaranteeing correctness (perfect utility) for Euler histograms. Relaxing these assumptions may come at the cost of utility. For example convex query regions that are not unions of cells can exactly count the number of bodies in the (enlarged) union of cells intersecting the QR. And general query regions will still result in excellent utility. Two important partition geometries satisfy these conditions: rectangular and Voronoi partitions.

The fourth assumption controls the -Lipschitz smoothness of Euler histogram counts with respect to input bodies. This parameter—also known as the global sensitivity (cf. Definition 3)—calibrates the scale of noise added for differential privacy. We consider a motivating example to be regions of frequent visitation. These are necessarily bounded. With sufficiently large, no restriction is made on valid bodies.

Without loss of generality we assume partitions are square of side length , divided into rows and columns, yielding square cells of side length (cf. Fig. 3).

V Algorithms and Analysis

Our approach consists of four complementary algorithms.

V-a Algorithm: Euler

Algorithm 1 creates a data structure (Euler histograms cf. Sec. III-A) to represent aggregated counts of a given set of convex planar bodies . The algorithm simply increments counts for any face, edge, vertex that intersects a body.

Input : Set of planar bodies ; partition
Output : Euler histogram
1 for  do
2      
3for  do
4       for  do
5             if   then
6                  
7            
8      
Algorithm 1 Euler (Eu): Histogram Construction

Privacy. Euler is qualitatively private via aggregation, but it does not achieve any differential privacy being deterministic.

Utility. Assumptions A1–A3 guarantee the preconditions of the following, direct results of (1).

Corollary 1.

If input bodies, partition cells, and query region are convex, and the query region is a union of cells, then Euler’s responses to the range query via (1) are accurate.

Corollary 2.

Euler is consistent (P3) and covert (P4).

Computational Complexity. As our partition has rows and columns, Euler’s time and space complexities are efficient at and respectively.

V-B Algorithm: DiffPriv

Euler achieves a number of our target properties but not differential privacy. We now introduce differential privacy to our approach by perturbing Euler histogram counts. In Algorithm 2, we add carefully-crafted random noise based on the sensitivity of the histogram to input bodies. We truncate any resulting negative counts to zero, improving utility at no cost to privacy.

Privacy. The key step to establishing the differential privacy of DiffPriv, is to calculate Lipschitz smoothness for Euler—the scale of noise to be added to reduce sensitivity.

Definition 3.

Let be a deterministic, real-vector-valued function of a database. The -global sensitivity (GS) of is given by , taken over all neighbouring pairs of databases.

The -global sensitivity is a property of function , independent of input database. For Euler histograms, the GS measures the effect on the histogram count vector, due to changing an input planar body related to a user’s spatial region.

Lemma 1.

The -global sensitivity of Euler is , where is the cell side length, and is an bound on planar body diameter.

Proof:

By Assumption 4 (cf. Fig. 3), the number of cells that could intersect with a body is at most in one direction. Therefore the total number of cells that could intersect a body is From this the number of faces, edges and vertices of partition intersecting with a body can be upper-bounded as

#Faces
#Edges
#Vertices

Summing these, we may bound the total number of partition components intersected by the body as

Since changing a single body in a database can affect impacted histogram cell counts by one, this expression is also a bound on global sensitivity. ∎

DiffPriv applies the Laplace mechanism [11] to Euler: it adds to a non-private vector-valued function , i.i.d. Laplace-distributed noise with centre zero and scale given by , for desired privacy level . Here, .

Input : Euler histogram: ; privacy ; sensitivity
Output : Noisy histogram:
1 for  do
2       if   then
3            
4      
Algorithm 2 DiffPriv (DP): Laplace Perturbation
Corollary 3.

DiffPriv preserves -differential privacy.

Proof:

The result follows by applying the triangle inequality to the odds ratio using the definition of Laplace density, and global sensitivity [11]. ∎

Utility. DiffPriv is neither covert nor consistent, however we can bound its utility.

Theorem 1.

For confidence level , the counts output by Euler and counts output by DiffPriv are uniformly close with high probability

Proof:

For convenience, we define the combined index set , noting that . Recall that by the definition of DiffPriv, we have that

By the cumulative distribution function of the zero-mean Laplace, it follows that

for any scalar . By the union bound it follows that

Applying De Morgan’s law,

Solving yields so that

The result follows from iid. ∎

Computational Complexity. On our rows/column partition, DiffPriv’s time/space complexities are efficient .

V-C Algorithm: Linear Programming

After adding randomised noise with DiffPriv, we apply constrained inference to smooth this noise, as detailed below. We first begin by defining constrained inference, followed by a set of consistency constraints.

V-C1 Constrained Inference: LAD

Constrained inference models the noisy counts output by DiffPriv as noisy observation of latent counts which are themselves related according to a set of constraints. Inference effectively smooths the differentially-private release, potentially improving utility without affecting privacy. Previously ordinary least squares (OLS) has driven constrained inference [14, 25]. Here we propose instead to use least absolute deviation (LAD) (also referred to as least absolute residuals, least absolute errors and least absolute value) [26]. In contrast to OLS, LAD has the benefit of being robust to outliers. LAD is ideal for our setting, since its choice of minimising error corresponds to maximising the exponential of the negative : a Laplace noise model, akin to maximum-likelihood estimation, matching DiffPriv precisely.

Definition 4.

Let be the Euler histogram counts with a set of defined constraints, . Given noisy histogram counts, , constrained LAD inference returns vector , that satisfies the constraints while minimising .

Consistency. We define three constraints C1, C2 and C3 for Euler histograms as follows. Our consistency constraints consider the relationships between face, edge and vertex counts. Every increment to an edge count must correspond to an increment to the counts of both incident faces as well; and similarly for an increment to a vertex count, the corresponding four incident edge counts must be incremented. Finally query regions should respond with non-zero count estimates. These represent the intuition behind our three sets of consistency constraints. For ease of exposition, we refer to face, edge and vertex components of by respectively. The meaning will be apparent from context.

Constraint 1.

Every edge count is less than or equal to the minimum value of its two incident faces.

Constraint 2.

Every vertex count is less than or equal to its four incident edges’ counts.

Constraint 3.

Every by grid partition should have a non-negative count computed by Euler (1).

Algorithm. We consider two constrained inference programs for enforcing these constraints. Both minimise the change to the histogram counts subject to the constraints. The first, LAD, minimises counts with respect to the -norm.

By introducing a primal variable per histogram cell count, we can transform this to the following linear program

(2)
s.t.

Alternatively we could adopt the -norm for minimising the change to the histogram cell counts, as in the following program.

And again we may transform this program to an equivalent LP, this time introducing only a single new primal variable

(3)
s.t.

We analyse Program (3), however we recommend that in practice Program (2) be used since it is better able to minimise change to all cell counts, while Program (3) only minimises the maximum error. Algorithm 3 and our experiments reflect this recommendation.

Input : Noisy Histogram:
Output : Consistent Histogram:
Solve Program (2).
Algorithm 3 LinProg (LP): Linear Programming

Privacy. Since LinProg depends only on the output of DiffPriv, it preserves the same level of differential privacy.

Utility. We can establish high-probability utility bounds on LinProg that take a similar form to those proved for DiffPriv, but via different arguments.

Theorem 2.

For any confidence level , and for histogram counts output by DiffPriv and minimising Program (3), we have

Proof:

We reduce to the bound on DiffPriv, by noting that since LinProg is minimising distance, the distance from to must be no more than to . In other words

with the final bound holding w.p. at least . ∎

Computational Complexity. Linear programming interior-point methods—also referred to as barrier algorithms—are polynomial-time, with worst-case complexity of  [27], for , the number of variables. Therefore, for Euler histograms the time complexity is , but in practice it is efficient as demonstrated in our runtime experiments (cf. Sec. VI-H).

V-D Algorithm: Rounding

After running LinProg, we introduce covertness via Round. This allows the curator to hide that the data has been perturbed.

Input : Consistent Histogram:
Output : Rounded Histogram:
1 for  do
2      
Algorithm 4 Rounding (R)

Privacy. Since Round depends only on differentially-private data, it also preserves differential privacy.

Utility. The analysis of utility for Round is more straightforward than for DiffPriv and LinProg.

Lemma 2.

If is the output histogram of LinProg and is the result of Round, then .

Lemma 3.

Round is consistent when run after LinProg, and so it is also covert.

Proof:

We only need to check the consistency constraints, as to whether Round violates any. This cannot happen, since the smaller side of a constraint inequality rounding up must coincide with the larger side rounding up. Similarly the larger side rounding down must coincide with the smaller side doing the same. Therefore, consistency is invariant to rounding. ∎

Computational Complexity. Similar to DiffPriv, Round’s time and space complexities are an efficient .

(a)
(b)
(c)
(d)
Fig. 8: Pre-processing in experimental setup: Computing the KDE and mode for a set of GPS points, then convex hull. Based on a sample of one cab’s GPS points in San Francisco, from Cabspotting.

V-E Full Theoretical Analysis

We are now able to combine the individual utility analyses of the four stages of our approach, into an overall high-probability bound on utility.

Corollary 4.

For confidence level , and histogram counts , output by Euler and Round respectively we have that

holds with probability at least .

Proof:

By Theorems 12, Lemma 2, triangle inequality

w.h.p., where . Continuing

We have used the following counts, where is the number of rows/columns in the grid-partitioned area of volume :

This completes the proof. ∎

Note, the utility bound’s error is w.h.p.

Remark 1.

In order to achieve appropriate utility, we recommend selecting cell size , based on third party requirements. The smallest QR that a third party might run on an area is a reasonable choice for . can naturally be set by users or service provider. There is little risk that would be made too large, as a user cannot have a very large region representing their regular location in a short time interval. In e.g., fitness applications, users can determine their area that they usually do their workouts.

Vi Experimental Study

Vi-a Datasets

We conduct extensive experiments on three real-world datasets, that vary in terms of density and concentration of locations. One dataset records GPS coordinates of more than 500 taxis over 30 days in the San Francisco Bay Area. Cab mobility traces are provided through the cabspotting project [28]. Here, cabs’ GPS points are more concentrated on the financial district and surrounding areas (cf. Fig. (a)a); we select this area for the empirical study (cf. Fig. (c)c). Our remaining datasets are in Beijing (Microsoft Research Asia), Geolife project Version 1.3 [29], as well as T-Drive [30]. In Geolife 1.3, GPS trajectories were collected by 182 users, containing 18,000 trajectories. 91.5 percent of the trajectories are logged in a dense representation (every 1–5 seconds or every 5–10 meters per point). GeoLife dataset gathered a broad range of users’ outdoor movements, including not only everyday routines e.g., going home and commuting to work but also entertainment and sporting activities, including shopping, sightseeing, dining, hiking, and cycling. T-Drive includes the GPS trajectories of about 10,000 taxis within Beijing, with a total number of points at about 15 million. Compared to GeoLife, T-Drive has a relatively better distribution of users’ spatial regions in a partitioned space.

Vi-B Pre-processing

We pre-process each dataset to extract convex planar bodies, representing regions where users mostly frequent. This simulates a real application where extraction might be conducted at the end point e.g., in a fitness tracker where users can set their workout area.

  • Fit a kernel density estimate (KDE) and consequently take the mode of each user’s set of GPS points;

  • Take k-nearest neighbours (k-NN) points to the mode, e.g., for GeoLife, 8 hours corresponds to . If the number of GPS points are less than we take all points;

  • Check if all the points are within the defined diameter, otherwise discard outliers; and

  • Compute the convex hull of remaining points to create a convex planar body representing an area of frequent visitation.

Fig. 8 demonstrates the trajectory of a cab in San Francisco (a)a, taken from the Cabspotting project. In this picture (cf. Fig. (b)b), the level sets within the contour lines are convex, and we could have picked these for our convex planar body. But in general level sets are not convex. Our approach generates a convex approximation. As depicted in Fig. (c)c, cab GPS points in this dataset are dense and concentrated in a specific area. Fig. (d)d illustrates the extracted convex body.

After pre-processing, we create histogram counts per each convex body, to construct the Euler histograms as our baseline approach and as the basis for our other algorithms.

Vi-C Parameter Settings

Initial settings for Beijing with four parameters (area side length), (cell size), (bounded diameter), are , , and respectively. These settings are applied on T-Drive, and GeoLife1.3 datasets. With regard to San Francisco, Cabspotting dataset, area size is , and cell size is but the remaining parameters are the same (cf. Table I).

Even though the literature on point data [14, 15] tends to use only specific QR sizes, we vary the QR parameter over the entire range of the area size to more fully evaluate our technique. For experiments where we compare histograms, the ratio, which defines the number of grid cells for each axis, has been kept constant for all datasets (cf. Sec. VI-H).

Vi-D Evaluation Metrics

Apart from the varying parameter, we keep all other parameters fixed to compute the median relative error as an empirical measure of utility, as is standard [14, 15]. We repeat each of the experiments 100 times and compute median relative error. The baseline approach is Euler as it provides exact answers. Algorithms DiffPriv, LinProg, Round that are privacy-preserving, are compared to Euler. Furthermore, we compute the running time for each algorithm (cf. Sec. VI-H).


Dataset Cell Size (d) B Area Size (A) A/d QR Size/Shape
T-Drive 1km 2km 20km*20km 20 1-10% 1
T-Drive 1km 2km 20km*20km 20 10-100% 1
T-Drive 0.66,1,2km 2km 20km*20km 30,20,10 1% 1
T-Drive 2km 2km 20km*20km 10 1% 0.1,0.4,0.7,1
GeoLife1.3 1km 2km 20km*20km 20 1-10% 1
GeoLife1.3 1km 2km 20km*20km 20 10-100% 1
Cabspotting 0.8km 2km 3.2km*3.2km 4 10-100% 1
TABLE I: Experimental settings. This table shows the range of parameters, bolded are those that are varying.
(a) QR Size (1-10% of Total Area). (b) QR Size (10-100% of Total Area). (c) Various QR Shapes (smaller). (d) Various QR Shapes (larger).
Fig. 13: Median relative error per query size and shape for T-Drive dataset.
(a) QR Size (1-10% of Total Area). (b) QR Size (10-100% of Total Area). (c) Various QR Shapes (smaller). (d) Various QR Shapes (larger).
Fig. 18: Median relative error per query size and shape for GeoLife1.3 dataset.
(a) Various QR Shapes. (b) QR Size (10-100% of Total Area).
Fig. 21: Median relative error per query size and shape for Cabspotting dataset.
Fig. 22: Varying area size/cell size ratio for T-Drive dataset.
Fig. 23: Varying privacy parameter for T-Drive dataset.

Vi-E Varying Query Rectangle Size

In this section we compute the median relative error on all datasets, representing diversity in terms of sparsity, density and concentration, to demonstrate effect on accuracy. We fix every parameter, except QR size to run a range query on various sizes, with varying position on the partitioned map, based on definition of a QR as a union of grid cells. Range queries are varied from 1 to 10 and 10 to 100 percent of the total area size of the respective city. The results for various sizes as well as shapes of a range query are shown in Figs. 1321. Various parameters can affect the response of a QR, including shape of a QR, size of a QR, whether convex bodies are sparse in the space or dense, or if they are concentrated or not. Furthermore, the computed global density, see Lemma 1, is different for different dataset settings, e.g., for both T-Drive and GeoLife datasets, and for Cabspotting, and this value also affects the results. The similarity between T-Drive and Cabspotting is that both record taxi driver movements; but a difference is that the former is not concentrated on a specific area while the latter is. In GeoLife1.3 the convex bodies are more dense, having a large number of trajectories.

As depicted in Fig. 13 for the T-Drive dataset, since the data is more evenly distributed the error is very low for larger QR sizes (Fig. (b)b), and is less than for smaller QRs (Fig. (a)a). A variety of QR shapes for the smaller sizes (Fig. (c)c), and larger ones (Fig. (d)d) are depicted accordingly. For instance, 1% QR in a partitioned-map of Beijing city could be , , geometries, first number represents the number of rows and the second one shows the number of columns. Compared to GeoLife1.3 (Fig. 21), since trajectories are more focused on some area, the error increases by decreasing QR size (Fig. (a)a). With regard to the Cabspotting dataset (Fig. 21), some parts of the selected area are sparser which consequently affects the result of DiffPriv for the QR sizes of 50% and 60%, as they contain dense and sparse cells. However for larger QRs errors cancel each other out due to the Euler formula (1). In all cases, LinProg and Round reduce the errors, and provide a high level of accuracy. Since the number of spatial partitions for the chosen area is smaller than the other datasets, only QR sizes and shapes between 10%–100% are shown in Figs. (a)a and (b)b. The QR errors for the smaller sizes 1%–9% are less than 10%.

LinProg and Round provide similar results, and as discussed in Sec. V, the difference is the covertness property of Round.

Providing consistency, through the LinProg and Round techniques, can improve accuracy (cf. Secs. VI-FVI-G). For the rest of the experiments for varying other parameters, we focus results on T-Drive dataset, and the 1% QR size as a conservative representative, since it has higher error.

Vi-F Varying Area Size/Grid Cell Size Ratio

We vary the area size (A) over grid cell size (d) ratio and compute the median relative error for QR taken as 1% of total area of T-Drive dataset. The area size for this dataset is . By increasing the cell size, we expect that the accuracy improves, as demonstrated in Fig. 23. We have fixed the QR as 1%, and varied the size of the grid cell in a range 0.66km, 1km, and 2km to yield the ratios of 30, 20, and 10 respectively. As shown, by increasing the grid cell size the accuracy increases. As illustrated in Fig. 23, as we decrease the grid cell size, the error increases due to higher values of global sensitivity for smaller cell sizes: , , are the global sensitivity (GS) values for , , and cell sizes respectively. If we wish to decrease without incurring reduced accuracy, our theoretical results suggest that we should also decrease and .

Vi-G Varying Privacy Parameter

We apply a similar procedure to vary the privacy parameter across values 0.1, 0.4, 0.7, and 1 with fixed QR of 1% of the total area , and cell size . The effect of increasing on accuracy is depicted in Fig. 23. Decreasing the epsilon value from , will increase the scale parameter of Laplace distribution (added noise to the counts) from to for , and this affects the accuracy of the result. To keep accuracy relatively constant when reducing , the third party can vary other parameters.

Fig. 24: Running time per algorithms for all datasets.

Vi-H Running Time

Fig. 24 shows running times for all datasets of various sizes. As discussed in Sec. VI-C, we kept the ratio fixed. The running time for all the datasets are approximately similar per each technique. The y-axis is in seconds (log-scale) and for the largest dataset GeoLife1.3, the total running time is 196 seconds. DiffPriv, LinProg and Round take less than 1 second for all the datasets. Each of our algorithms are eminently practical to implement and to run.

Vii Concluding Remarks

For the first time we propose a non-interactive differentially-private approach to counting planar bodies representative of users’ spatial regions e.g., a workout area, areas of customer preference for hotel bookings, or locations of frequent visitation for facility planning.

The key insight of our approach is to leverage Euler histograms for accurate counting, cell perturbations for differential privacy, and constrained inference smoothing to reinstate consistency. Constrained inference often improves utility by cancelling noisy perturbations. Our formulation of constrained inference is a novel constrained application of the robust method of least absolute deviations. Unlike existing constrained inference based on ordinal regression, our formulation precisely matches our privacy-preserving cell perturbation distribution. By optimising for consistency while rounding cell counts, we achieve a covertness property for our counting mechanism: third parties cannot determine that we have perturbed data in the first place.

A full theoretical analysis of utility and differential privacy is complemented by experimental results on three datasets.

Acknowledgement

This work was supported in part by Australian Research Council DECRA grant DE160100584.

References

  • [1] G. Ghinita, Privacy for Location-based Services, ser. Synthesis Lectures on Information Security, Privacy, and Trust.   Morgan & Claypool, 2013.
  • [2] J. Krumm, “Inference attacks on location tracks,” in Pervasive Computing, PERVASIVE’07, 2007, pp. 127–143.
  • [3] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao, “Efficient OLAP operations in spatial data warehouses,” in SSTD’01, 2001, pp. 443–459.
  • [4] Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias, “Spatio-temporal aggregation using sketches,” in ICDE’04, 2004, pp. 214–225.
  • [5] I. F. V. López, R. T. Snodgrass, and B. Moon, “Spatiotemporal aggregate computation: a survey,” IEEE Trans. KDE, vol. 17, no. 2, pp. 271–286, 2005.
  • [6] F. Braz, S. Orlando, R. Orsini, A. Raffaetà, A. Roncato, and C. Silvestri, “Approximate aggregations in trajectory data warehouses,” in ICDE’07, 2007, pp. 536–545.
  • [7] L. Leonardi, S. Orlando, A. Raffaetà, A. Roncato, C. Silvestri, G. L. Andrienko, and N. V. Andrienko, “A general framework for trajectory data warehousing and visual OLAP,” GeoInfo., vol. 18, no. 2, pp. 273–312, 2014.
  • [8] R. Trudeau, Introduction to Graph Theory.   Dover, 1993.
  • [9] R. Beigel and E. Tanin, “The geometry of browsing,” in LATIN ’98, 1998, pp. 331–340.
  • [10] C. Sun, D. Agrawal, and A. El Abbadi, “Selectivity estimation for spatial joins with geometric selections,” in EDBT’02, 2002, pp. 609–626.
  • [11] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC’06, 2006, pp. 265–284.
  • [12] A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino, “Private record matching using differential privacy,” in EDBT’10, 2010, pp. 123–134.
  • [13] R. Chen, B. C. M. Fung, B. C. Desai, and N. M. Sossou, “Differentially private transit data publication: a case study on the Montreal transportation system,” in KDD’12, 2012, pp. 213–221.
  • [14] G. Cormode, C. M. Procopiuc, D. Srivastava, E. Shen, and T. Yu, “Differentially private spatial decompositions,” in ICDE’12, 2012, pp. 20–31.
  • [15] W. H. Qardaji, W. Yang, and N. Li, “Differentially private grids for geospatial data,” in ICDE’13, 2013, pp. 757–768.
  • [16] X. He, G. Cormode, A. Machanavajjhala, C. M. Procopiuc, and D. Srivastava, “DPT: differentially private trajectory synthesis using hierarchical reference systems,” PVLDB, vol. 8, no. 11, pp. 1154–1165, 2015.
  • [17] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar, “Privacy, accuracy, and consistency too: a holistic solution to contingency table release,” in PODS’07, 2007, pp. 273–282.
  • [18] H. Xie, E. Tanin, and L. Kulik, “Distributed histograms for processing aggregate data from moving objects,” in MDM’07, 2007, pp. 152–157.
  • [19] M. Fanaeepour, L. Kulik, E. Tanin, and B. I. P. Rubinstein, “The CASE histogram: privacy-aware processing of trajectory data using aggregates,” GeoInformatica, pp. 1–52, 2015.
  • [20] C. Dwork, “Differential privacy: A survey of results,” in TAMC’08, 2008, pp. 1–19.
  • [21] H. To, G. Ghinita, and C. Shahabi, “A framework for protecting worker location privacy in spatial crowdsourcing,” PVLDB, vol. 7, no. 10, pp. 919–930, 2014.
  • [22] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft, “Learning in a large function space: Privacy-preserving mechanisms for SVM learning,” J. Privacy and Confidentiality, vol. 4, no. 1, pp. 65–100, 2012.
  • [23] D. J. Mir, S. Isaacman, R. Cáceres, M. Martonosi, and R. N. Wright, “DP-WHERE: differentially private modeling of human mobility,” in BigData’13, 2013, pp. 580–588.
  • [24] C. Li, M. Hay, G. Miklau, and Y. Wang, “A data- and workload-aware query answering algorithm for range queries under differential privacy,” PVLDB, vol. 7, no. 5, pp. 341–352, 2014.
  • [25] M. Hay, V. Rastogi, G. Miklau, and D. Suciu, “Boosting the accuracy of differentially private histograms through consistency,” PVLDB, vol. 3, no. 1, pp. 1021–1032, 2010.
  • [26] T. E. Dielman, “Least absolute value regression: recent contributions,” J. Stat. Computation and Simulation, vol. 75, no. 4, pp. 263–286, 2005.
  • [27] N. Karmarkar, “A new polynomial-time algorithm for linear programming,” in STOC’84, 1984, pp. 302–311.
  • [28] M. Piorkowski, N. Sarafijanovoc-Djukic, and M. Grossglauser, “A Parsimonious Model of Mobile Partitioned Networks with Clustering,” in COMSNETS, 2009. [Online]. Available: http://www.comsnets.org
  • [29] Y. Zheng, X. Xie, and W. Ma, “Geolife: A collaborative social networking service among user, location and trajectory,” IEEE Data Eng. Bull., vol. 33, no. 2, pp. 32–39, 2010.
  • [30] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang, “T-drive: driving directions based on taxi trajectories,” in ACM-GIS’10, 2010, pp. 99–108.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11083
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description