Artificial Impostors for Location Privacy Preservation

Artificial Impostors for
Location Privacy Preservation

Cheng Wang, , and Zhiyang Xie Cheng Wang, Zhiyang Xie are with the Department of Computer Science and Engineering, Tongji University, and with the Key Laboratory of Embedded System and Service Computing, Ministry of Education, China. (E-mail: cwang@tongji.edu.cn, 121xzy@tongji.edu.cn)
Abstract

The progress of location-based services has led to serious concerns on location privacy leakage. For effective and efficient location privacy preservation (LPP), existing methods are still not fully competent. They are often vulnerable under the identification attack with side information, or hard to be implemented due to the high computational complexity. In this paper, we pursue the high protection efficacy and low computational complexity simultaneously. We propose a scalable LPP method based on the paradigm of counterfeiting locations. To make fake locations extremely plausible, we forge them through synthesizing artificial impostors (AIs). The AIs refer to the synthesized traces which have similar semantic features to the actual traces, and do not contain any target location. Two dedicated techniques are devised: the sampling-based synthesis method and population-level semantic model. They play significant roles in two critical steps of synthesizing AIs. We conduct experiments on real datasets in two cities (Shanghai, China and Asturias, Spain) to validate the high efficacy and scalability of the proposed method. In these two datasets, the experimental results show that our method achieves the preservation efficacy of and , and its run time of building the generators is only and seconds, respectively. This study would give the research community new insights into improving the practicality of the state-of-the-art LPP paradigm via counterfeiting locations.

{keywords}

Location Privacy Preservation, Artificial Impostor Trace, Process-free, Population-level

1 Introduction

With the proliferation of smart mobile terminals, e.g., smart phones and vehicle-mounted communication device, and the location-based services (LBS) over these mobile devices, the privacy issue of locations have attracted much attention [1]. Location privacy usually refers to the demand to prevent other parties from learning one’s current or past locations [2]. The obfuscation mechanism is wildly utilized to protect the location privacy, which transforms a actual location to a set of locations on a trusted proxy. Then the providers of LBS will answer queries to all of the locations in this set, and the trusted proxy will return the useful information to the user.

Some obfuscation methods generate location set by blending the actual location with locations of other clients [2, 3, 4]. However, these methods leak the privacy of other real clients. Some methods replace user’s location with point of interests near him [5, 6]. To some extent, the substitutes might reveal the region where the user is. Other methods generate fake locations based on different moving patterns, and mix an actual location with fake ones in a set [7, 8, 9]. Unfortunately, these protection methods are vulnerable while facing inference attacks, or suffer from huge time costs. The server or adversary can easily identify the fake ones with some side information. Taking this condition as an instance: The fake location is in a supermarket at 2:00 am. According to a common sense, the supermarket is closed at this time. This kind of attack uncovers the fake ones through the semantic relations between the locations and time or activities. Obviously, these fake locations is too implausible to hide themselves. In other words, the goal of this kind of obfuscation mechanism is to make the fake ones plausible in the server side.

In this paper, we aim at designing a scalable and high-quality location privacy preserving method. The preservation quality (efficacy) of a model is defined as the attacker’s probability of error to infer the real trace [10]. To achieve high efficacy of privacy protection, we generate fake locations by synthesizing artificial impostor traces. The so-called impostors refer to the synthesized traces which have similar semantic features to the actual traces, and do not contain target locations. It is worth mentioning that a location refers to a region in our method for the reason of location cloaking. To make impostor traces plausible, we utilize visiting patterns of locations and mobility patterns of users. Visiting pattern refers to the visitors’ temporal distributions of regions. Mobility pattern refers to the transition probabilities among regions and the run time of passing through regions of users. There are two steps in our method: The first step is to extract regional semantic features based on the visiting pattern (offline generator). The second is to generate impostor traces based on the mobility pattern (online generator). Accordingly, we devise two critical techniques in two steps to achieve a high efficacy with a low computational complexity:

(1) A Sampling-Based Synthesis Method. Users’ trajectories are composed of their visited locations. A straightforward method for synthesizing an impostor trace is to forge fake locations corresponding to all the visited locations. However, this complicated method cannot definitely improve the preservation efficacy. For the reason that the requirement of too many simultaneously plausible fake locations possibly limits the number of candidate plausible traces, then let down the preservation efficacy. When generating a plausible but fake trace for a user, it is convincingly unnecessary to utilize all the locations in his/her trajectory. Our solution is to directly sample visited locations. The challenge here is how to keep the plausibility of generated artificial impostor traces based on the sampled visited locations. To this end, we only choose some of pause points (not the locations just travelling past) as the starting and ending points of every stage of trip, defined as stations. They are the semantic places where people want to go instead of the simple sampling locations on the road. For example, Alice drives home after work, and wants to buy a hamburger in a drive-through restaurant. There are three stations along her trace: Her company, the restaurant and her home. Except for these three stations, the locations on the sections she drives among them are relatively meaningless in the privacy sense. Based on these stations, we synthesize an impostor trace by the following procedure: Replacing them with fake ones (semantically similar locations), and then complementing corresponding plausible visiting paths by finding the -th most possible traces according to the transition probabilities among regions.

(2) A Population-Level Semantic Model. Station refers to the record with the location and time. To make the generated fake station resemble the real one, we let the fake and real stations have similar semantic regional features. In the process of extracting the regional semantic features, a critical step is the location clustering. Each cluster of locations has a latent semantic feature. To dig the semantic similarity among locations in terms of visited pattern, an effective approach is to analyze the transfer probability among visited locations along traces of different users. Unsurprisingly, such an individual-level approach is so careful that it can complete semantic clustering with an impressively high efficacy. However, a huge computational cost of this extremely carefulness is indeed daunting, though the achieved high efficacy is really enviable. In this work, under the precondition of high efficacy, we strive for a low computational cost by devising a population-level model to extract features and cluster locations. We divide the city map into different regions, break up all trajectories into independent locations. After that, we extract stations of users’ traces, and model the geographical distribution of these stations in different regions. The feasibility of our population-level model depends on the fact that the basic semantic feature of a region can be judged from the mobility of crowds. Consider a simple situation where most people leave home to work in the morning, and back to home at night. It is obvious that the out-stream is large in the morning and the in-stream is large in the night of uptown, whereas the opposite is the statistical stream of people in workplace. Hence, based on this geographical distribution model, we can compute the semantic similarity of different regions, and then aggregate all regions into clusters with different latent semantic features.

With the cooperation of these two techniques, our method successfully achieves high computational efficiency with conspicuous preservation efficacy. Intuitively, the sampling-based synthesis model can to some extent subserve the improvement of preservation efficacy despite the reduction due to the coarse clustering by the population-level model.

In the experiment, we compare our method with four representative ones in the different scales of maps in two cities (Shanghai, China and Asturias, Spain). We validate the performance by the state-of-the-art location inference attack [10]. The main experimental results in these two datasets (Shanghai’s and Asturias’s, respectively) can be summarized as follows:

In the small scale map (), the preservation efficacy of our method is slightly lower than SG-Lppm [7], the state-of-the-art method in terms of preservation efficacy, but is significantly higher than other three methods. Moreover, our method has a significant advantage in the time consumed by generating a fake trace online over SG-Lppm. Specifically, our method achieves the efficacy of and , compared with and by SG-Lppm in two datasets, respectively. While, all other methods achieve the efficacy less than . To generate one fake trace online, it takes our method only ms and ms, compared with s and s by SG-Lppm in two datasets, respectively.

In the large scale map () of two cities, we build the impostor generator offline in just and seconds in two datasets, and we can achieve the efficacy of and , respectively. However, SG-Lppm have been executed for two weeks without any output, due to its high time complexity.

The rest of this paper is organized as follows: In Section 2, we provide the related work. We give an overview on our proposed scheme in Section 3, and describe two main steps of the scheme in Section 4 and Section 5, respectively. We provide the validation of preservation efficacy and the evaluation of scalability of our methods in Section 6. Finally, we draw a conclusion in Section 7.

2 Related Work

There has already existed plenty of remarkable studies on location privacy preservation [7, 11, 12, 13, 14, 15, 16]. Some techniques are independent with the trusted third party (TTP) [17, 18, 13, 19]. These TTP-free techniques achieve fine-grained privacy-preserving by encryption techniques. In the other techniques, TTP is required to achieve location privacy preserving. Massive resource-consuming can be implemented in TTP instead of mobile devices, which makes it popular. The location privacy preservations based on TTP are actually the methods to transform the target location to a set of locations, which keeps the utility of services and privacy of data. According to the different ways of generating location sets, we classify these methods into categories as follows:

(1) Anonymizing the actual location in a set of locations of other users. This anonymization mechanism is a function to hide the user in the class with users in [2, 20, 3, 4, 21, 22]. Hara et al. [22], selected the fake traces among others with some strict constraints for a targeted real trace. Although this protection performs well, other’s privacy was leaked in this process. Yao et al. [4] divided the area into clusters (namely ) where each cluster includes users. When a client proposes a query to the LBS provider, the boundary of the area he belongs to will be sent to the servers. This limits an adversary’s probability to infer the accurate location to . Actually, if there are no clients near the user, the will be too large to provide satisfactory service. Moreover, this exposes the general location of them. Lee et al. [3] extracted the semantic features of regions by the distribution of nearby users’ staying duration of locations. By the semantic features, they computed the cloaking area. But the approach is invalid when there is not enough other users around.

(2) Replacing the actual location with the a set of other places. This mechanism is a kind of methods to replace the real location with the place near-by (e.g., spatial cloaking) [12, 23, 6, 5, 24]. Yiu et al. [6] transformed the user’s location to intersection or building nearby. However, the distance between the real location and target has impact on the utility and privacy. If there is no target near the user, the response from LBS provider will not match the real query accurately, and if the target is close enough to the user, it will expose his/her location. The spatial transformation method [5] uses Hilbert curves to transform users’ locations, and sends the transformed location to the LBS server. The disadvantage of this method is that it requires LBS providers to transform all locations data (such as locations of shops). The maintaining cost of services is noticeable.

(3) Generating a set with real location and generated dummy locations in [25, 26]. Generating dummy locations aims to hide the real location among a set of fake locations. The LBS provider makes response to all the queries, and TTP will filter the information of user’s requirement. So, this kind of methods can keep the service utility. To improve the effectiveness, dummy locations can be generated based on synthesizing traces. You et al. [9] generated dummy locations based on random walk. Krumm [8] built the path between two random locations in the map. To make fake locations more plausible, locations can be classified by semantic features. The method in [3] generates semantic features by the time durations of places, [27] by combining human mobility with POIs, and [28] by computing the frequency of visiting locations. On this basis, Shorkri et al.[7] extracted the mobility patterns as common semantic features through matching every locations of every two users in the real mobility datasets, and replaced all locations of a user’s real trace with ones in the same semantic cluster to generate fake traces. This method takes the correlation of sequential locations into account, so that extend generating fake locations to fake traces. This approach achieves a high efficacy in their paper. However, while enjoying such efficacy, it has to suffer from a huge time cost to generate semantic clusters of regions, and more simultaneously plausible fake locations limit the number of candidate plausible traces, then possibly let down the applicability.

3 Overview of Proposed Scheme

Fig. 1: When a user proposes a query about the record in the -th time interval to the LBS provider (e.g., the nearby restaurants of the current location), her device first publishes this query and other information to a third trusted proxy. The proxy synthesizes impostor traces through the generator, and generates fake records by extracting ones at the -th time interval of impostor traces. TTP transfers the set of blended data to the LBS server. Due to the indistinguishability, the server responds to all the queries in the set. At last, the proxy will filter the target response among all returned values, and send it with the generated fake records back to the user.

In this section, we present an overview of our scheme to generate dummy traces. Our approach is performed in the third trusted proxy (TTP) which is well accepted in many location privacy preservation methods. TTP is a centralized architecture, so large-scale calculations can be implemented on it. Fig. 1 illustrates the system structure of our method, and the procedure is elaborated in the caption. The fake records generated before and real trace are stored in the user’s mobile device. The reason for storing generated fake records will be explained in Section 5.1.

We discretize time and space by dividing the map into fixed-size grids and dividing a day into time intervals, obtaining benefits in two aspects:

(1) Keeping the stability of location privacy and utility of service when using LBS.

(2) Partitioning the continuous geographical and temporal data into discrete ones, so that providing lots of mathematical deduction convenience in the following analysis.

As depicted in Fig. 2, there are steps in our scheme. We firstly extract the semantic feature of each gird based on a given dataset of seed traces. Each seed consists of stations and sections. We can extract them in the preprocessing phase corresponding to Step 1. Then by analyzing the temporal distribution of stations of each grid, we compute the similarity between every two grids. We can extract semantic features of these grids by clustering them in Step 2. In Step 3, we establish the mobility model for users which includes the transition probabilities among grids and the run time of passing through grids. These processes are based on the mobility of the crowd which is population-level. Obviously, the seed traces’ privacy can be protected during the processes. Building the online generator module for impostor traces depends on the realization of the former module. Its procedure refers to Steps 4 in Fig. 2. In this step, for a real trace, we extract its stations, and replace them with records which have similar reginal features. Then we fill in sections among these stations to generate impostor traces. After giving the overview of our scheme, we will elaborate our scheme in detail in Section 4 and Section 5. Table I presents the list of notations that we use in this paper.

Fig. 2: Our scheme includes two modules: offline generator module of synthetic rules (corresponds to the Steps 1-3) and online generator module of impostor traces (corresponds to the Step 4).
Notation Definition
  a region (location)
  a time interval
  the set of regions
  the number of time intervals when generating semantic features
  the number of time intervals when building mobility model
  the number of time intervals when sampling locations of user’s trace
 (t) the number of people flow into or flow out of r at t
  the distribution of people flow into or out of r
  the semantic class where station in
  the semantic similarity graph of regions
  the transition probability graph of regions
  tensor of run time of travelling across regions at
  the estimated time of reaching region
  the semantic distance between regions and
  the geographical distance between regions and
TABLE I: Table of notations

4 Offline Generator for Synthetic Rules

In this section, we propose the rules of synthesizing impostor traces. This is an offline procedure. We build this module by seed traces. As illustrated in Fig. 2, we divide the module into three parts: extracting stations and sections (Step 1), generating semantic features of regions (Step 2), and modeling mobility pattern of users (Step 3).

4.1 Extract Stations and Sections

This module is designed under the framework of spatial-temporal cloaking. So, the map is divided into fixed-size grids. When users’ devices upload their trajectories to the trusted proxy, the data are composed of a series of records of GPS coordinates and the corresponding time. We first standardize the records by matching GPS coordinates to grids, and matching accurate time to time intervals. For each trajectory, it is a sequence of region-time pairs, denoted by .

The station is not just the simple pause location during the trip, but the phased destination of a user. The life time of a person can be abstracted into a sequence of activities with locations and time. We define the station as the record where and when the activity is changed. For a vehicle driver, we focus on the process of driving, his activities can be divided into two categories: Driving and non-driving. So, stations are the the records where and when a driver just stops and does something irrelevant to drive (e.g. shopping or picking up someone) or the records where and when a driver just starts the car. Specifically, for a taxi driver, the station is the record where and when he just stops to pick up passengers, or lets passengers off. For a private car driver, the activity time is the duration of parking time which exceeds the threshold, and stations are records of the start and the end of this time duration. Intuitively, we usually concern about the destination where we want to go, but the route we choose toward the destination always depends on the traffic condition.

We define the section as the targeted trace from a station to the next station. Obviously, a station is the end of the last section, and also the start of the next section. The trace consists of several continuously sections. Fig. 3 depicts two examples. Intuitively, stations are the milestones of a trace.

Fig. 3: Alice drives back home (location ) from company (location ) in the evening, and she stops twice: The first one is the crossroads where she waits for the traffic light, and the second one is the drive-in restaurant where she gets a burger. The company, drive-in restaurant, home are stations, and the sections are the traces among them. Taxi driver Bob picks up passengers at location , and drops them at location . and are stations for him. The corresponding section is the section from to .

4.2 Generating Semantic Features of Regions

Our goal here is to generate semantic features of regions by their visiting patterns. This process consists of two steps: The first is computing the semantic similarities between every two regions; the second is clustering regions based on similarities. To this end, we model the distribution of human flow for each region.

In the trajectory of a user, only stations are typical in the semantic sense for users, because they are phased destinations, and sections between them are usually selected based on choosing paths with the minimum costs. For example, a man goes on a business trip in an unfamiliar city. He intends to a company from the hotel. A rational decision for him is usually to choose the path with the limited time. During the travel process, he dose not concern about the locations he just passed. The hotel and company are the only two locations with semantic features for him. So, we generate semantic features only by stations. Intuitively, if there is a start or end station of a person’s trace in a region, this region has semantic meaning to him, but if he just passes through the region, it means nothing to him. If we take all the records in the whole trace into account, the non-station records will cause inaccuracy. For each region grid, we do not care about the movement of people inside, and the movement of people just passing through it, but focus on the temporal distribution of stations in this region.

As the Step 2 in Fig. 2, we propose a semantic metric to compare the similarity among stations based on the temporal distributions of the crowd. In this step, we assign a high similarity value to a pair of regions if distributions of flow-ins and flow-outs of them are similar, regardless of their geographical distance. For example, the flow-ins of region and are most distributed in the morning, and the majority of flow-outs of them are distributed in the evening. Obviously, their distribution of human flow are very similar. Although , we assume that and are semantically similar regions. In this example, and might be workplaces.

The metric of semantic similarity between different regions is to compare the changes of human flow over time. Thus, we divide a day into time intervals, and denote the number of intervals by . In the previous process, we extract stations from seed traces. We let and denote the numbers of human flow-in and flow-out of region in time interval , respectively. For the different sources of generating human flow, there are two types of methods to calculate and :

(1) The human flow is caused by the mobility of mobile devices’ owner (e.g. private car driver). For example, if a man goes to a bar in the evening, he visits this place, which increases of this region. In this case, we compute the human flow as follows: During the period from the time when a client reaches region to the time he leaves , if station in his trace is the first station in , we assume that a person enters and visits this region, so adds . Analogously, if station in his trace is the last station during this period, he leaves this region, and adds . For example, a driver drives into region from region at time , and stops his car to work in his company at time (station ). At the time , he leaves his company drives to home (station ), and he reaches the next region at time . So, adds , and adds . The sketch is depicted in Fig. LABEL:NAdds1.

(2) The human flow has nothing to do with the mobility of mobile devices’ owner (e.g. taxi driver). For example, a taxi driver takes a passenger to a hospital. This driver does not contribute to the human flow of this region, but the passenger does. In this case, there must be a signal which represents the mobility of passengers. If the signal indicates that a passenger gets off in region in the time interval , adds . On the contrary, a passenger gets on, adds .

For each region in the map, after computing the human flow (flow-in and flow-out), we can normalize them by the following equations:

(1)

We define the semantic similarity based on the Kullback-Leibler Divergence (KLD). It is a metric of how one probability distribution diverges from the other probability distribution. We use to represent the KLD between distributions and , and to represent the symmetry KLD:

(2)
(3)

There are two distributions for every region: Distributions of flow-ins and flow-outs. For regions and , we let ( ) and () denote the distributions of flow-in (flow-out) of them, respectively. The semantic distance between and is:

=

We can compute the semantic similarity between and by:

where and (with ) represent the importance indexes of flow-in and flow-out, respectively, and denotes a normalization constant:

By computing the similarities between every two regions, we can build a graph . Its vertexes are regions, and the weight of an edge represents the semantic similarity between two vertexes. We implement the hierarchical clustering algorithm on to group regions into distinguishing classes. The regions fall into the same class are similar in the distributions of human flow, despite their geographical distance from each other. In other words, the probabilities that people visit and leave them are close in a specific time interval, so they represent the semantic features of regions. For example, if there are two residential areas, people are more likely to leave these areas in the morning and visit them in the evening. Thus we consider these regions are semantically equivalent.

The process of generating regional semantic features is depicted in Algorithm 1.

0:  Stations of seed traces
0:  Clusters of regions
1:  Initiate a weighted graph with regions as vertices
2:  for each region  do
3:     Compute distribution and
4:  end
5:  for each region  do
6:     for each region  do
7:        Compute the semantic distance between and by:
8:        
9:        Compute the edge weight between and by:
10:        
11:     end
12:  end
13:  Return: Cluster regions by implementing Hierarchical Clustering algorithm on
Algorithm 1 Extracting Semantic Features of Regions

4.3 Modeling Mobility Pattern

We build the model of users’ mobility pattern in this subsection. It is a location-dependent first-order Markov chain [29] on the set of regions. For mobility patterns of individuals vary with time, we assume that a day is partitioned into several time intervals. Denote the number of time intervals as . The mobility profile is of a given time , which will be elaborated later.

In the time interval , is a weighted directed graph which depicts the transition probability between two regions. It is established based on the Markov chain, which means the position of a user is just related to the last position in her trace. The vertexes of are the regions in the map, and the weight of edge from to is , where is the probability that users will move to region from region in the time interval . If is not adjacent to , equals . The reason of setting as the weight is that the process, finding the trace with the -th maximum probability, can be transformed into the process of finding the -th shortest path in .

For a given time , is a three-dimensional tensor which represents the run time of passing through regions. The entry is the run time of travelling across region from to in this time interval. For example, region and are adjacent to . In time , a user leaves region at time , and enter . Then he reaches region at time . We record the entry as:

(4)

However, if region is not adjacent to region , it holds that

where is the set of all regions.

When users travel between two stations, they usually choose the way with the minimum cost (time or fuel consumption). Between the same starting and destination locations, in the same interval, users usually choose the similar path, and the travelling time is similar too. We model the mobility pattern in a population-level way, where and are averaged by taking all users into account. The advantages can be summarized as follows:

(1) The impostor trace performs like ordinary user’s trace, and it will not reveal the character of any individual client.

(2) It can protect the individual private trajectories in data processing.

The detailed process is referred to in Algorithm 2.

0:  Sections between stations of seed traces
0:  Mobility for each time interval
1:  for each time interval  do
2:     Initiate a weighted directed graph with regions as vertices
3:     for each region  do
4:        for each region  do
5:           Compute the edge weight by:
6:           
7:        end
8:     end
9:     Initial a three-dimensional tensor
10:     for each region  do
11:        for each region  do
12:           for each region  do
13:               the average time of passing through from to
14:           end
15:        end
16:     end
17:  end
Algorithm 2 Modeling Users’ Mobility in Different Intervals

5 Online Generator for Impostors

In this section, we present the details of our method for synthesizing impostor traces based on the actual location and offline module built previously. It is depicted in Algorithm 3. When a user propose a query about a record to the LBS server, her record is updated to the online generator in TTP. In this scenario, the generator extracts a part of the real trace of this user containing the target record, and synthesizes impostor traces according to it. The query is about a record but not a trace. After generating a collection of impostor traces, the fake records can be obtained as follows:

We first determine the time interval of the actual record (reported to LBS). Then we select the records in the time interval of impostor traces as the fake ones.

We synthesize impostor traces by utilizing the stations instead of all of the records in the complete trace. Reasons are as follows:

(1) Stations represent the semantic features of activities.

(2) We utilize stations but not all of the records. It means less restriction of the input data.

(3) We can synthesize reasonable traces with low complexity.

For users’ traces, we divide a day into intervals. It indicates that the locations in the trace are sampled every hours.

0:  (1) A real record      (2) Trajectory around the real record      (3) Fake records generated before      (4) number of impostor traces ()
0:  Impostor traces
1:  Locate real start station , and real end station
2:  for station {station stations in the real trace from to do
3:     if  has fake records generated before then
4:        The set of candidate fake records {fake records generated before}
5:     else
6:        The location of semantic class
7:        The set of candidate fake records {records , = time of }
8:     end
9:  end
10:  for every two successive stations , {station stations in the real trace from to do
11:     for each candidate fake location of  do
12:        for each candidate fake location of  do
13:            the geographical distance between central points of and
14:           Add an edge between and with weight ,
15:        end
16:     end
17:     Match candidate fake records of and in the bipartite graph by Kuhn-Munkres algorithm
18:     Choose pairs of fake records whose geographical distances are the top similar to
19:     for every pair fake records of () and of (do
20:        Time interval
21:        Impostor trace -th shortest path from to in the graph
22:        for -th record in the impostor trace do
23:           Compute estimate time of reaching region
24:           Compute the cloaking time of
25:        end
26:     end
27:  end
Algorithm 3 Synthesizing Plausible Traces

5.1 Transforming Stations into Fake Ones

In this subsection, we generate fake stations based on the target records and side information. As illustrated in Section 3, the real trace and the fake records generated before are stored in the user’s mobile device. A user proposes a query about the record in the -th time interval and uploads the relevant information to the TTP. The information consists of two parts: The first part is the user’s trajectory during the period from hours before the time of the target record to hours after it. The second part contains the fake records of stations generated before. We divide stations into two categories, the special ones and ordinary ones. For a station, if we have generated its fake ones in the previous procedures of synthesizing impostors, we regard it as a special station. Otherwise, if it does not have corresponding fake ones, it is an ordinary station.

The process of transforming stations into fake ones begins with extracting a part of real trace. This part trace is user’s real trajectory from the start station to end station, and it is the template of impostor traces which will be synthesized in our framework. There are two rules of selecting the start and end stations under different conditions: (1) In the trajectory from the side information, if there are special stations before (after) the target record, we select the special one which is nearest to the target record in the time dimension as the start (end) station. (2) If there is no special station before (after) the target record, we select the ordinary one which is nearest to the target record in the time dimension as the start (end) station. It is worth mentioning that if there is no station after the target record, we will regard the target one as a station.

After receiving the target record and side information, TTP first identifies stations in the trajectory (from information) by the previous definition in Section 4.1. Then TTP selects the start and end stations to extract the part of real trace. As depicted in Fig. 5(a), due to the different states of stations, we have combinations of the start and end stations.

For a special station, in this process, the fake candidate records are the same with ones generated before. For an ordinary station, to keep the semantic similarity between the fake and actual stations, the fake station should be extracted from the same semantic class with the actual one. For any station in a trace, its location corresponds to the regional semantic class . In order to prevent the leakage of the original information, we remove from before picking the fake station. So, we choose the fake location . In this case, the fake candidate stations are the records

{, time of }.

(a) combinations of start and end stations: (1) Special start station, and special end station. (2) Special start station, but ordinary end station. (3) Ordinary start station, but special end station. (4) Ordinary start station, and ordinary end station.
(b) An example of transforming stations into fake ones
Fig. 5: The sketch of transforming stations into fake ones.

It is noted that there may be some other stations in the trace from start to end. We synthesize fake sub-traces between every two continuous stations. For any station in this trace, it has the corresponding fake candidates, and its fake stations are selected from them. The selection of fake stations is not just based on randomization. It utilizes the rule that the distance between two continuous candidate fake stations should correspond to the travel time of real trace. Otherwise, the fake stations can be identified easily by attackers. For example, if two fake stations are apart, but the run time between their corresponding real stations is just minutes, it is obviously unreasonable. Intuitively, we intend to ensure that fake stations are semantic and geographical similar to the actual pairs.

To keep the utility of service, we partition the map into small grids, e.g., regular . We define the distance between regions as the distance (in a straight line) between two central points of them. For every two continuous stations in the real trace, the first one is the start station , the second is the end station . We compute the distance between every two regions from two sets. One is the locations set of fake candidates of , and the other is the locations set of fake candidates of . We can build a bipartite graph whose vertices are divided into two disjoint and independent sets and respectively, such that every edge connects one vertex in to one in . The weight of the edge connects and is calculated by:

where represents the geographical distance. After building the graph, we utilize Kuhn-Munkres algorithm to match the region from to the one from , and keep the maximum sum of weights [30]. To generate impostor traces, we choose pairs of regions, whose geographical distances are the top similar to the real. Remark that the impostor trace should have the same number of stations with the real trace. For example, if , the sketch of transforming stations into fake ones is shown in Fig. 5(b). After generating fake records for an end station, its candidate set turns to fake records generated just now, and it will be applied to generate fake records for the next station. This approach can keep the geographical similarity between imposters and real traces.

5.2 Complementing Impostor Trace

Trace is a sequence of records. So, in this subsection, we fill in the trace among fake stations. For a section from the start station to end station, any random walk appears to be logical. However, if a vehicle is in the wrong direction of a road, it can be seen through easily. We fill in the section based on the mobility model in a specific time interval (). For a pair of the start location and end location , our goal is to find out the trace () with the largest probability. The process of this step can be formulated by:

For the value of is a constant for a specific start location, the formula above can be shortened as:

(5)

However, consider a situation, there is a strong attacker who can construct the knowledge of user’s mobility in advance. If impostor traces generated by our model are always based on the maximum transition probability, attacker can filter out the impostor traces easily, because the probability of real traces is not always the maximum value. So, we choose the trace with the -th greatest probability as the fake. To increase the robustness of our algorithm, we add some randomness to the selection of the value of . The probability of is dependent on the statistic of the real data.

In our weighted directed graph of regions, given start and end regions, we can select the trace with the -th greatest probability by computing the -th shortest trace between two regions in the graph . We find the -th shortest path based on the Dijkstra’s algorithm and algorithm [31]. The weight of the edge is , so that the distance in the graph of a trace () is

The -th shortest trace is the trace with the -th greatest probability.

5.3 Add Timestamp

After generating the impostor trace (), we add timestamps to each region in the trace. As illustrated in Fig. LABEL:addTimestamp, the start time is and end time is according to the real trace. We assign the time interval as the interval including the time . The run time of traveling through every region along the impostor trace is the statistical time multiplied by . We compute and the estimated time to arrive at region by the following procedure:

The time we estimate is accurate to a second. So we need to transform it into the interval as LBS providers’ requirement. We can compute the cloaking time interval of region in the impostor trace as .

6 Evaluation

6.1 Experiment Setup

(a) Dataset in Shanghai
(b) Dataset in Asturias
Fig. 7: The Change of Time Consumption and Transition Probability with Time Growing.

In this section, we implement our method on the real traces of two datasets. One is the dataset of taxis’ traces in Shanghai, China [32], and the other is the dataset of private cars’ traces in Asturias, Spain [33]. The vehicular trajectories data in Shanghai were collected from taxis from April 1st, 2015 to April 30th, 2015. They were recorded every few seconds in days. The raw dataset of Shanghai is a series of records, which contains the latitude, longitude, time and the signal of whether carrying passengers. The original data in Asturias is the GPS traces for one year collected from cars, and records are reported with an interval of seconds. The Asturias’ dataset is a series of records containing the latitude, longitude, and time. The experiments are implemented based on the different scales of maps in Shanghai and Asturias.

We first preprocess the datasets: Dividing the map into regular grids, and dividing a day into several time intervals (). To choose the appropriate , we make statistics of the time consumption when passing through a grid, and the transition probability between grids in the different time for two datasets. In every hour, we randomly choose grids, and calculate the average time consumption when passing through them. We also randomly choose pairs of grids to calculate the average transition probability between them. As depicted in Fig. 7, the result shows that the transition probability of one dataset fluctuates around a constant value. However, the time consumption of traveling through a grid varies from hour to hour. To keep the accuracy, we build the mobility model every hour in a day ().

For the dataset of taxis in Shanghai, the station is the record where and when the signal of whether carrying passengers is changed. If the signal of a record changes from to , it represents a passenger getting on at this time. So there is a passenger leaves this location. On the contrary, when a passenger getting off, he visits this location. In the experiment, we extract stations from the dataset.

For the dataset of private cars in Asturias, we assume that if the time of a vehicle stay in a grid is longer than a threshold, it is a station in the trajectory. Note that the thresholds are variable in different grids and time intervals. We denote the average speed of region in time interval as , and set the threshold in this region 0.4and time equals

The first and last records in this grid are stations, which means that a user visits and leaves this location, respectively. We extract stations from this dataset.

In the step of filling the path based on , we choose the -th shortest path. The probability of depends on the statistic of the real data. We count the transition probability rank of traces between stations. The relations between the rank () and proportion of two datasets are shown in Fig. 8.

Fig. 8: The relation between rank (k) and proportion.

The experiments are implemented by Python, and the computer used in the experiments has: A CPU Intel(R) Xeon(R) E- at GHz, GB of RAM, TB of disk space and Windows Server R Enterprise.

6.2 Efficacy

(a) Dataset in Shanghai
(b) Dataset in Asturias
Fig. 9: The Relations of Efficacy, and .

In this subsection, we measure how effectively our method works by a general inference attack [10]. Specifically, given the user’s observed traces and basic knowledge needed, the state-of-the-art adversary’s attack model runs the inference attack to filter out the real traces. This attack is a model of speculating the actual location of a user at each time in a trace. Given the data of user’s mobility and an observation traces, it constructs the knowledge of users, and infers the real trace based on Hidden Markov model. The preservation efficacy of model is defined as the probability of error to infer the real trace by an attacker. The higher the probability is, the more effectively the protection mechanism performs. Consider an exemplary situation. TTP sends a real trace of Alice and impostor traces together to LBS provider, and the adversary Bob has access to the LBS queries. He tries to infer the real trace from blended data by a strong inference attack. If he finds real traces from groups of blended data, the accuracy rate of adversary is , i.e., the preservation efficacy is .

In the experiment for testing the efficacy, we adopt fine-grained traces. We divide a day into intervals (), and input seed traces to build the offline generator. For the dataset in Shanghai and Asturias, the average lengths of seed traces are and intervals, respectively. Maps used in the experiment are the central areas of Shanghai city, and Asturias city which both contain grids.

To choose the most proper parameters , () and when generating regional semantic features, we record the relations among efficacy, and in Fig. 9 when synthesizing impostors for each trace. According to the result, we set the importance index , and .

After the process of computing semantic similarity among regions by Algorithm 1, we define the threshold in the Hierarchical Clustering algorithm as , and cluster grids based on the similarity graph. Then we resample traces as the input to synthesize impostor traces by Algorithm 3.

We generate different numbers of impostor traces (, , , per real trace) through different synthetic generation techniques:

(1) IID-Population [10]: Generating fake locations independently and identically distributed based on the mobility in a population level.

(2) RW-Population [34]: Generating fake locations from a random walk based on the mobility in a population level.

(3) RW-Individual [35]: Generating the path between locations based on the mobility in an individual level.

(4) SG-Lppm [7]: Generating fake traces of the real trace by the semantic features of regions. It extracts the semantic features through matching every locations of every two users. In the process of generating fake traces, it replaces all locations in a user’s real trace with ones in the same semantic cluster, and generats traces by Hidden Markov Model.

(a) Dataset in Shanghai
(b) Dataset in Asturias
Fig. 10: Efficacy of Different Generation Techniques.

After that, we input the observed traces which contain the real and impostor traces to the attack model [10], and compute the efficacy of different methods.

Fig. 10 depicts the efficacy of these methods. The result shows that our method and SG-Lppm perform much better than others in these two datasets, achieving the efficacy surpassing when inputting different numbers of observed traces. When uploading observed traces for each real trace, our approach can achieve the preservation efficacy of , and in datasets of Shanghai and Asturias, respectively, comparing with and by SG-Lppm. Though they performs slightly better than ours, the gap is acceptable. The methods, IID-Population, RW-Population, and RW-Individual, do not consider the semantic features of traces, and have randomness in generating impostor traces. For the case of the state-of-the-art protection SG-Lppm, it synthesizes traces based on the regional semantics, which are extracted from time and space features of human mobility. They build a straightforward but high cost module by considering all locations in individual traces. Comparing with SG-Lppm, in our method, a population-level model neglects the latent spatiotemporal correlations among visited locations, then inevitably leads to the degradation of extracting semantic features. Therefore, under the premise of avoiding non-negligible efficacy degradation, we achieve low computational complexity. The experiment results are shown in the next subsection.

6.3 Scalability

According to the results of efficacy evaluation, ours method and SG-Lppm perform much better than others. So, in this subsection, we compare the scalability of these top two methods in two aspects: Memory Consumption and Time Consumption.

6.3.1 Memory Consumption

We evaluate the memory consumption of our method and SG-Lppm. When establishing the generator of impostor traces, we store the basic and necessary data in memory, which contain the semantic similarity of grids, and the users’ mobility model. The space complexity of ours is , comparing with for SG-Lppm, where is the number of input traces, and is the number of regions.

In the experiment, we compute the memory consumption when using different scales of maps (, , , and grids with an area of ). We set , and input seed traces corresponding to different scales of maps to build generators. Fig. 11 shows that the memory consumption and growth rate of ours are much less than SG-Lppm. On the map of grids in Shanghai city and Asturias, the consumptions of ours are MB and MB, respectively. The consumptions of SG-Lppm are MB and MB, respectively.

Fig. 11: Memory Consumption of Building the Module.

6.3.2 Execution Time

We compute the run time of building the offline generator between ours and SG-Lppm. The time complexity of ours is , comparing with their , where is the number of seed traces (usually a large number), is the number of regions in the map, and is the average length of the traces (). It is obvious that if we try to build the impostors generator, it requires large amounts of input traces. In this evaluation, we compare the execution time of scales of maps (, , , and grids), and input different numbers of seed traces for each map when . The scales of the input data is the number of locations of the seed traces (). The relations between the scale of data and time consumption under different scales of maps are shown in Fig. 12. On the map of grids, the maximal scales of dataset in Shanghai and Asturias are about million and million, respectively. The time consumptions are seconds and seconds, respectively. Under these conditions, we have executed SG-Lppm for two weeks on the server without any output. This is the reason why we only show the execution time of generating synthesizing model of our method in Fig. 12. In contrast to such a frustrating result, a preservation efficacy of and can be obtained by our method under this condition in two datasets, respectively.

(a) Dataset in Shanghai
(b) Dataset in Asturias
Fig. 12: Run Time of Building the Module.
 Map Scales
   Shanghai ms ms ms ms
   Asturias ms ms ms ms
TABLE II: Run time of generating one impostor trace in two datasets

Moreover, we test the time consumptions of generating traces through the online impostors generator on different scales of maps: , , , and grids (the size of grids are ). We input real traces, and generate impostor traces for each input. We record the time consumptions for synthesizing one impostor trace in different scales of maps in Table. II. SG-Lppm is really not scalable for large-scale maps and massive seed traces. So we only implement SG-Lppm on the map of grids. The run time of synthesizing one impostor trace is and seconds in the datasets of Shanghai and Asturias, respectively, much large than our time of ms and ms.

SG-Lppm [7] analyzes the transfer probabilities among visited locations along traces of different users, and forges a fake location corresponding to every visited location when synthesizing fake traces. Such refined method brings high preservation efficacy. However, such an individual-level approach causes impressively high time complexity.

7 Conclusion

We design a scalable and high-quality method for location privacy preservation based on the paradigm of synthesizing impostor traces. Two dedicated techniques are devised: the population-level semantic model and sampling-based synthesis method. Combining these two techniques, our method successfully achieves high preservation efficacy with low computational complexity. Our method is proved to be capable of applying to the problems of different sizes. We validate the scalability of our method from aspects of both memory consumption and execution time.

References

  • [1] D. Eckhoff and C. Sommer, “Driving for big data? privacy concerns in vehicular networking,” Proc. IEEE S&P, 2014, vol. 12, no. 1, pp. 77–79, 2014.
  • [2] A. R. Beresford and F. Stajano, “Location privacy in pervasive computing,” IEEE Pervasive computing, vol. 2, no. 1, pp. 46–55, 2003.
  • [3] B. Lee, J. Oh, H. Yu, and J. Kim, “Protecting location privacy using location semantics,” in Proc. ACM SIGKDD 2011.
  • [4] L. Yao, G. Wu, J. Wang, F. Xia, C. Lin, and G. Wang, “A clustering k-anonymity scheme for location privacy preservation,” IEICE Transactions on Information and Systems, vol. 95, no. 1, pp. 134–142, 2012.
  • [5] A. Khoshgozaran and C. Shahabi, “Blind evaluation of nearest neighbor queries using space transformation to preserve location privacy,” in Proc. SSTD 2007.
  • [6] M. L. Yiu, C. S. Jensen, X. Huang, and H. Lu, “Spacetwist: Managing the trade-offs among location privacy, query performance, and query accuracy in mobile services,” in Proc. IEEE ICDE 2008.
  • [7] V. Bindschaedler and R. Shokri, “Synthesizing plausible privacy-preserving location traces,” in Proc. IEEE S&P 2016.
  • [8] J. Krumm, “Realistic driving trips for location privacy,” in Proc. 7th International Conference on Pervasive Computing (Pervasive 2009).
  • [9] T.-H. You, W.-C. Peng, and W.-C. Lee, “Protecting moving trajectories with dummies,” in Proc. IEEE MDM 2007.
  • [10] R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux, “Quantifying location privacy,” in Proc. IEEE S&P 2011.
  • [11] R. Cheng, Y. Zhang, E. Bertino, and S. Prabhakar, “Preserving user location privacy in mobile data management infrastructures,” in International Conference on Privacy Enhancing Technologies.
  • [12] C. A. Ardagna, M. Cremonini, S. D. C. D. Vimercati, and P. Samarati, “An obfuscation-based approach for protecting location privacy,” IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 1, pp. 13–27, 2010.
  • [13] Y. Dou, K. C. Zeng, H. Li, Y. Yang, B. Gao, C. Guan, K. Ren, and S. Li, “P-sas: preserving users’ privacy in centralized dynamic spectrum access systems,” in Proc. ACM MobiHoc 2016.
  • [14] W. M. Liu, L. Wang, P. Cheng, K. Ren, S. Zhu, and M. Debbabi, “PPTP: privacy-preserving traffic padding in web-based applications,” IEEE Transactions on Dependable and Secure Computing, vol. 11, no. 6, pp. 538–552, 2014.
  • [15] B. Liu, W. Zhou, T. Zhu, H. Zhou, and X. Lin, “Invisible hand: A privacy preserving mobile crowd sensing framework based on economic models,” IEEE Transactions on Vehicular Technology, vol. 66, no. 5, pp. 4410–4423, 2017.
  • [16] K. P. N. Puttaswamy, S. Wang, T. Steinbauer, D. Agrawal, A. E. Abbadi, C. Kruegel, and B. Y. Zhao, “Preserving location privacy in geosocial applications,” IEEE Transactions on Mobile Computing, vol. 13, no. 1, pp. 159–173, 2014.
  • [17] R. Shokri, G. Theodorakopoulos, P. Papadimitratos, E. Kazemi, and J. P. Hubaux, “Hiding in the mobile crowd: Locationprivacy through collaboration,” IEEE Transactions on Dependable and Secure Computing, vol. 11, no. 3, pp. 266–279, 2014.
  • [18] Y. Zheng, M. Li, W. Lou, and Y. T. Hou, “Location based handshake and private proximity test with location tags,” IEEE Transactions on Dependable and Secure Computing, vol. PP, no. 99, pp. 1–1, 2017.
  • [19] D. Henrici and P. Muller, “Hash-based enhancement of location privacy for radio-frequency identification devices using varying identifiers,” in Proc. IEEE PerCom 2004.
  • [20] B. Hoh and M. Gruteser, “Protecting location privacy through path confusion.”
  • [21] M. S. Kirkpatrick, G. Ghinita, and E. Bertino, “Privacy-preserving enforcement of spatially aware rbac,” IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 5, pp. 627–640, 2012.
  • [22] T. Hara, Y. Arase, A. Yamamoto, X. Xie, M. Iwata, and S. Nishio, “Location anonymization using real car trace data for location based services,” in Proc. ACM ICUIMC 2014.
  • [23] J. Freudiger, M. H. Manshaei, J. P. Hubaux, and D. C. Parkes, “Non-cooperative location privacy,” IEEE Transactions on Dependable and Secure Computing, vol. 10, no. 2, pp. 84–98, 2013.
  • [24] C. Y. Chow, M. F. Mokbel, and W. G. Aref, “Casper*:query processing for location services without compromising privacy,” Acm Transactions on Database Systems, vol. 34, no. 4, pp. 1–48, 2009.
  • [25] V. Bindschaedler and R. Shokri, “Privacy through fake yet semantically real traces,” Computer Science, 2015.
  • [26] R. Shokri, “Privacy games: Optimal user-centric data obfuscation,” Proceedings on Privacy Enhancing Technologies, vol. 2015, no. 2, pp. 299–315, 2015.
  • [27] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of different functions in a city using human mobility and pois,” in Proc. ACM SIGKDD 2012.
  • [28] T. M. T. Do and D. Gatica-Perez, “The places of our lives: Visiting patterns and automatic labeling from longitudinal smartphone data,” IEEE Transactions on Mobile Computing, vol. 13, no. 3, pp. 638–648, 2014.
  • [29] J. Krumm, “A markov model for driver turn prediction,” Sae World Congress, vol. 22, no. 1, pp. 1–25, 2008.
  • [30] H. Zhu, M. C. Zhou, and R. Alkins, “Group role assignment via a kuhn¨cmunkres algorithm-based solution,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 42, no. 3, pp. 739–750, 2012.
  • [31] D. Eppstein, “Finding the k shortest paths,” in Foundations of Computer Science, 1994 Proceedings., Symposium on.
  • [32] “Taxi dataset in shanghai,” http://www.datashanghai.gov.cn.
  • [33] X. G. Garc¨ªa, D. Melendi, S. Cabrero, and R. Garc¨ªa, “CRAWDAD dataset oviedo/asturies-er (v. 2016-08-08),” Downloaded from https://crawdad.org/oviedo/asturies-er/20160808, Aug. 2016.
  • [34] R. Chow and P. Golle, “Faking contextual data for fun, profit, and privacy,” in Proc. ACM WPES 2009.
  • [35] R. Kato, M. Iwata, T. Hara, A. Suzuki, X. Xie, Y. Arase, and S. Nishio, “A dummy-based anonymization method based on user trajectory with pauses,” in Proc. ACM SIGSPATIAL GIS 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
84143
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description