Machine Learning Aided Anonymization of Spatiotemporal Trajectory Datasets
The big data era requires a growing number of companies to publish their data publicly. Preserving the privacy of users while publishing these data has become a critical problem. One of the most sensitive sources of data is spatiotemporal trajectory datasets. Such datasets are extremely sensitive as users’ personal information such as home address, workplace and shopping habits can be inferred from them. In this paper, we propose an approach for anonymization of spatiotemporal trajectory datasets. The proposed approach is based on generalization entailing alignment and clustering of trajectories. We propose to apply -means algorithm for clustering trajectories by developing a technique that makes it possible. We also significantly reduce the information loss during the alignment by incorporating multiple sequence alignment instead of pairwise sequence alignment used in the literature. We analyze the performance of our proposed approach by applying it to Geolife dataset, which includes GPS logs of over 180 users in Beijing, China. Our experiments indicate the robustness of our framework compared to prior works.
Publishing data by different organizations and institutes is crucial for open research and transparency of government agencies. Just in Australia, since 2013, over 7000 additional datasets have been published on ’data.gov.au’, a dedicated website for data publication by the Australian government. Moreover, the new Australian government data sharing and legislation  encourages government agencies to publish their data, and as early as 2018 many of them will have to do so. The process of data publication can be significantly risky as it may contain individuals’ sensitive information. Therefore, an essential step before publishing data is to remove any personally identifying information from the dataset. However, such operation is not sufficient for privacy preservation. Adversaries are able to link the datasets using common attributes called quasi-identifiers, or may have prior knowledge about the trajectories travelled by the users which enables them to reveal sensitive information that can cause physical, financial and reputational harms to people.
One of the most sensitive sources of data is location trajectories or spatiotemporal trajectories. Despite numerous use cases that the publication of spatiotemporal data can provide to users and researchers, it poses a significant threat to users’ privacy. As an example, consider a person who has been using GPS navigation to travel from home to work every morning of weekdays. If an adversary has some prior knowledge about a user, such as the home address, it may be able to identify the user. This can compromise private information about the user, such as the user’s health condition and how often does the user visit his/her specialist. Therefore, it is crucial to anonymize spatiotemporal datasets before publishing them to the public.
Most of the work in the area of privacy preservation for spatiotemporal trajectories is focused on achieving -anonymity proposed in . The idea is to hide the true data among at least other data entries so that the trajectories of the users are not distinguishable. The authors in , adopted the notion of -anonymity for trajectories and proposed an anonymization algorithm based on generalization. Xu et al.  investigated the factors such as spatio-temporal resolution and the number of users released. The authors in  focused on improving the clustering approach in the anonymization process. The proposed anonymization scheme is based on achieving -anonymity by grouping similar trajectories and removing the ones that are highly dissimilar. More recently, the authors in  developed an algorithm called k-merge to anonymize the trajectory datasets while preserving the privacy of users from probabilistic attacks. Local suppression and splitting techniques were considered in .
In general, there are two major problems with the existing methods. Firstly, it lacks well-defined methods to cluster trajectories as there is no easy way to measure the cost of clustering according to distances among trajectories. Secondly, the existing literature focuses on pairwise sequence alignment which results in a high amount of information loss during the trajectory alignment. In our work, we address these problems by proposing a method to anonymize spatiotemporal trajectories. Our approach has two main contributions:
Use of multiple sequence alignment instead of pairwise sequence alignment, which significantly reduces the cost of alignment.
Applying -means clustering algorithm for clustering trajectories by developing a technique to enable the approach.
We analyze the performance of our proposed anonymization method by applying it to Geolife dataset which includes GPS logs of over 180 users in Beijing, China. Our experiments indicate the robustness of our proposed method compared to recent work in .
2 System Model & problem formulation
We assume that a map has been discretized into an grid and the time is discretized into bins of length . Therefore, each point in the dataset represents a snapshot of a real-world location query including -coordinate, -coordinate, and time. In our model, we consider a spatiotemporal trajectory datasets denoted by . The dataset consists of trajectories where represents the number of trajectories in the dataset (). Each trajectory is an ordered set of spatiotemporal 3D points (). Each point is a triplet of -coordinate, -coordinate, and the time of query, respectively.
2.1 Privacy and Threat Models
In this paper, we adopt a well-known metric called -anonymity  to ensure the privacy of users.
k-anonymous dataset: A trajectory dataset is a -anonymization of a trajectory dataset if for every trajectory in the anonymized dataset , there are at least other trajectories with exactly the same set of points, and there is a one to one mapping relationship between the trajectories in and .
For the threat model, we assume that no uniquely identifiable information is released while publishing the dataset. However, the adversary may:
already know about part of the released trajectory for an individual and attempt to identify the rest of the trajectory.
already know the whole trajectory that an individual has travelled, but try to access other information released while publishing the dataset by identifying the user in the dataset.
Our aim is to protect the users against the adversary’s attempt to access sensitive information that may compromise the privacy of users.
2.2 Hierarchical Tree Transformation
To anonymize the dataset, generalization and suppression techniques are used based on domain generalization hierarchy (DGH). A DGH for attribute , referred to as , is a partially ordered tree structure which maps specific and generalized values of the attribute . The root of the tree is the most generalized value and is returned by function which contains zero bits of information.
Consider an map. The -coordinate attribute can have possible values (). DGH divides the largest possible interval for -coordinate (), which is the root of the tree, to two, four, and eight -coordinate intervals as DGH increases in depth. In this example, the lowest level of the tree can be shown by bits. One bit of information loss incurs by moving up one level in the tree. Fig. 1 shows the structure of -coordinate DGH. For the generation of -coordinate and time DGHs, a similar approach can be taken.
Each node on a DGH can be generalized by moving up one or multiple levels of the DGH. The process of generalizing to one of its parent nodes is denoted using . A special case of generalization, in which the node is generalized to the root of the DGH, is referred to as suppression. These two techniques are used as tools to anonymize the dataset in the following sections. It must be noted that although quasi-identifiers in this paper are -coordinate, -coordinate, and the time of query, the algorithms developed in our work can be extended to include other attributes as well.
2.3 Loss Metric
In this paper, we quantify the loss using similar metric proposed in .
The information loss incurred during the generalization and suppression while replacing with in DGH is defined as
where function returns the number of leaves in the subtree generated by a node and function returns the loss incurred by generalization of the nodes.
Consider the DGH given in Fig. 1 the loss incurred while generalizing node to can be calculated as .
While generalizing two nodes, it is necessary to find the lowest common ancestor (LCA). The definition of LCA is given in Definition 3.
The LCA of nodes and in is defined as the lowest common parent root of the two nodes. Function returns the LCA.
The total loss incurred by generalizing and in with their LCA can be calculated as
This value for generalization of trajectory to achieve the anonymized trajectory with respect to attribute can be calculated as
where indicates the -th location of the trajectory with respect to the attribute . Here, A could denote x-coordinate, y-coordinate, or time. Similarly, the total loss with respect to an attribute in an anonymized dataset can be computed as
The problem we seek to answer in this paper is formally presented in Problem 1 as follows.
Given a trajectory dataset , a privacy requirement , quasi identifiers -coordinate, -coordinate, and time, how to generate an anonymized dataset which achieves the -anonymity privacy metric and minimizes the total loss with respect to all the quasi-identifiers formulated as
3 Proposed Approach
Our proposed anonymization framework consists of a robust alignment technique and a machine learning approach for clustering the trajectory datasets which are presented in this section.
The process of alignment is defined as finding the best match between two trajectories in order to minimize the overall cost of generalization and suppression. The process of alignment between two trajectories has been studied in different domains mostly referred to as sequence alignment (SA). In this paper, we incorporate a multiple SA technique called progressive SA .
3.1.1 Progressive Sequence Alignment
The progressive SA is commonly used for SA of a set of protein sequences. Progressive SA is a heuristic approach for multiple SA. As part of the algorithm, pairwise alignment of the trajectories is also required. We use dynamic SA for pairwise alignment of trajectories in progressive SA. Dynamic SA is based on dynamic programming and commonly used in DNA SA [10, 11]. Fig. 2 explains an example of how the progressive SA works for four hypothetical sequences , , and to generate the resulting aligned trajectory . The longest path () is chosen as the basis and it is aligned with a randomly chosen trajectory (). The pairwise alignment process is implemented using dynamic SA. Then, the resulting trajectory is aligned with a third trajectory. The process continues until all the trajectories are aligned. Instead of choosing the trajectories randomly during the progressive SA, the algorithm can choose the trajectory resulting in the lowest loss during the alignment. In Fig. 2, the way trajectory elements are located with respect to the longest path is referred to as the structure of the shorter path and also the spaces indicate the suppression operation during the alignment.
The dynamic SA algorithm is formally represented in Algorithm 1. Dynamic SA is based on dividing the problem of finding the best SA to subproblems and storing the solutions of subproblems in a table or matrix referred to as in the pseudocode. The objective is to achieve the minimal cost for SA. As before, the cost of alignment refers to the loss incurred during the alignment for different attributes of the sequence which are -coordinate, -coordinate, and the time of the query. A subproblem generation for matching the first to -th element of () with the first to -th element of () can be given as 1) match and ; find the optimal alignment for and 2) suppress ; find the optimal alignment for and 3) suppress ; find the optimal alignment for and .
The algorithm starts by creating a matrix (), where and denote the length of the trajectories. The matrix will be used to store the minimum cost of each cell of the grid. Moreover, a list called stores how the cells have been reached. Cell can be reached from three cells . Each path corresponds to one of the subproblems explained. After finding all the values of the matrix and tracing back the list , the outputs of the algorithm would be the value of cell which is the minimum value of the total loss () required for the dynamic SA, the aligned trajectory (), and the structure of the shorter path compared to the longer path as .
Clustering can be seen as a search for hidden patterns that may exist in datasets. In simple words, it refers to grouping data entries in disjointed clusters so that the members of each cluster are very similar to each other.
3.2.1 -means Clustering Approach
-means algorithm  is an attractive clustering algorithm currently used in many applications particularly in data analysis and pattern recognition . The main advantage of -means algorithm is its simplicity and fast execution time. The reason behind using a prime notation on top of the variable is to avoid any confusion between the meaning of ”” in the clustering algorithm and the used in the definition of -anonymity addressed before.
The algorithm aims to partition the input dataset into clusters. The only inputs to the algorithm are the number of clusters and the dataset. Clusters are represented by adaptively-changing cluster centres. The initial values of the cluster centres are chosen randomly. In each stage, the algorithm computes the squared distance of data from the centroids and partition them based on the nearest centroid to each data. The algorithm continues the same process until the values of centroids no longer change. The -means algorithm is guaranteed to converge .
In the rest of this section, we explain how the -means algorithm can be applied to trajectory datasets to increase the privacy of users while publishing the data.
The total loss incurred by generalizing and with respect to can be calculated as
Lemma 1 indicates that the loss incurred by generalizing two nodes is equal to the difference of losses incurred by their suppression. As before, for any clustering outcome of data, assume that is a two-dimensional list in which -th element of the list returns the IDs of the trajectories in the -th cluster. Moreover, we denote the -th cluster head after generalization and suppression for all the trajectories by . Therefore, the total loss can be written as
Rearranging equation (9), the objective equation can be found by minimizing equation (3.2.1). This can be done by maximizing part B and minimizing part A. Since the cluster heads are generated based on the clustering algorithm, they cannot be used as part of the optimization process. Therefore, we aim at minimizing part A of the equation (3.2.1).
Part A in the equation (3.2.1) refers to finding the total distance of each trajectory from DGH root of the attributes. Therefore, for each trajectory a three dimensional vector is constructed where , , store the loss incurred by generalizing the -coordinate, -coordinate, and time, respectively. Having distance of all the points from the roots, we cluster the trajectories using the -means algorithm. The algorithm clusters trajectories with a similar loss from the root in the same group. This process is particularly important as trajectory datasets usually include trajectories as short as one query to trajectories with hundreds of queries.
-means algorithm clusters the trajectories without any constraint on the minimum number of trajectories that needs to be in each cluster. For more sensitive applications, a heuristic approach can be applied on top of the -means to make sure that the all the clusters include at least trajectories.
In our experiment, we use the data collected by Geolife project . The geolife dataset includes the GPS trajectories of users from April 2007 to August 2012 in Beijing, China. The dataset entails trajectories with a total distance of . Each entry of data is represented with coordinates and the time stamp in which the query has happened. We have conducted our experiments on central part of the Beijing map with the resolution of for each grid cell. The location privacy requirement () of the users are investigated for the values , , and . The experiments are performed on a PC with a GHz core-i7 Intel processor, -bit Windows operating system, and GB of RAM. Python program is used to implement the algorithms.
It must be noted that the dataset includes trajectories as large as hundreds of queries and as small as a single query from the location-based service provider. Therefore, matching such variant length trajectories would impose a large loss even for the best possible match of the sequences. Incurred loss of the proposed framework is demonstrated in Fig. 3. The -axis indicates privacy requirement for the dataset () and the -axis indicates the total loss incurred which includes the loss while applying generalization and suppression on -coordinate, -coordinate, and the time of the query. Furthermore, the maximum possible incurred loss which refers to suppressing all the trajectories is shown by a green dashed line on the graphs. As expected, increasing the value of results in a higher incurred loss due to having larger cluster sets which mean alignment of a higher number of trajectories in each cluster.
Due to lack of any constraint on -means algorithm, some of the users may experience privacy metric lower than -anonymity as it is the case in mobile networks in which some of the users may experience a lower quality of service. Fig. 4 indicates the average value of achieved while applying the -means algorithm. It can be seen in the figure that on average individuals are achieving a privacy level higher than applying the proposed approach. The value of the average gets even better as the value of increases.
Fig. 5 indicates the result of comparison between our proposed anonymization technique and the recent generalization method proposed in . The authors in , attempt to minimize the incurred loss during the anonymization by sorting out the spatiotemporal locations in the time domain and applying a heuristic approach. Note that the aim of any anonymization approach is to maximize utility while preserving the privacy of users. Utility in generalization techniques refers to the area released for locations in the dataset. Therefore, to have a fair comparison, we compare our work with the approach proposed in  based on the average released area for locations. It can be seen from the figure that our proposed algorithm can significantly increase the utility of the generalization approach. In other words, the anonymized dataset has on average smaller released area per location while preserving the privacy of users.
In this paper, we have developed a method to preserve the privacy of users while publishing the spatiotemporal trajectories. The proposed approach incorporates multiple sequence alignment for anonymization in addition to developing a technique that enables the use of machine learning methods for clustering in the context of privacy. We implemented the clustering based on -means clustering algorithm and applied it on to Geolife dataset.
-  A. Government, “New australian government data sharing and release legislation,” 2018.
-  L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002.
-  A. Tamersoy, G. Loukides, M. E. Nergiz, Y. Saygin, and B. Malin, “Anonymization of longitudinal electronic medical records,” IEEE Transactions on Information Technology in Biomedicine, vol. 16, no. 3, pp. 413–423, 2012.
-  F. Xu, Z. Tu, Y. Li, P. Zhang, X. Fu, and D. Jin, “Trajectory recovery from ash: User privacy is not preserved in aggregated mobility data,” in Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017, pp. 1241–1250.
-  Y. Dong and D. Pi, “Novel privacy-preserving algorithm based on frequent path for trajectory data publishing,” Knowledge-Based Systems, vol. 148, pp. 55–65, 2018.
-  M. Gramaglia, M. Fiore, A. Tarable, and A. Banchs, “Towards privacy-preserving publishing of spatiotemporal trajectory data,” arXiv preprint arXiv:1701.02243, 2017.
-  M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos, “Local suppression and splitting techniques for privacy preserving publication of trajectories,” IEEE Trans. Knowl. Data Eng, vol. 29, no. 7, pp. 1466–1479, 2017.
-  V. S. Iyengar, “Transforming data to satisfy privacy constraints,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002, pp. 279–288.
-  B. Chowdhury and G. Garai, “A review on multiple sequence alignment from the perspective of genetic algorithm,” Genomics, 2017.
-  X. Chen, C. Wang, S. Tang, C. Yu, and Q. Zou, “Cmsa: a heterogeneous cpu/gpu computing system for multiple similar rna/dna sequence alignment,” BMC bioinformatics, vol. 18, no. 1, p. 315, 2017.
-  Q. Le, F. Sievers, and D. G. Higgins, “Protein multiple sequence alignment benchmarking through secondary structure prediction,” Bioinformatics, vol. 33, no. 9, pp. 1331–1337, 2017.
-  J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.
-  S. K. Pal and P. P. Wang, Genetic algorithms for pattern recognition. CRC press, 2017.
-  A. Fischer and D. Picard, “Convergence rates for smooth k-means change-point detection,” arXiv preprint arXiv:1802.07617, 2018.
-  Y. Zheng, X. Xie, and W.-Y. Ma, “Geolife: A collaborative social networking service among user, location and trajectory.” IEEE Data Eng. Bull., vol. 33, no. 2, pp. 32–39, 2010.