Generative Models for Simulating Mobility Trajectories
Abstract
Mobility datasets are fundamental for evaluating algorithms pertaining to geographic information systems and facilitating experimental reproducibility. But privacy implications restrict sharing such datasets, as even aggregated locationdata is vulnerable to membership inference attacks. Current synthetic mobility dataset generators attempt to superficially match a priori modeled mobility characteristics which do not accurately reflect the realworld characteristics. Modeling human mobility to generate synthetic yet semantically and statistically realistic trajectories is therefore crucial for publishing trajectory datasets having satisfactory utility level while preserving user privacy. Specifically, longrange dependencies inherent to human mobility are challenging to capture with both discriminative and generative models. In this paper, we benchmark the performance of recurrent neural architectures (RNNs), generative adversarial networks (GANs) and nonparametric copulas to generate synthetic mobility traces. We evaluate the generated trajectories with respect to their geographic and semantic similarity, circadian rhythms, longrange dependencies, training and generation time. We also include two sample tests to assess statistical similarity between the observed and simulated distributions, and we analyze the privacy tradeoffs with respect to membership inference and locationsequence attacks.
Generative Models for Simulating Mobility Trajectories
Vaibhav Kulkarni Department of Information Systems UNILHEC Lausanne vaibhav.kulkarni@unil.ch Natasa Tagasovska Department of Information Systems UNILHEC Lausanne natasa.tagasovska@unil.ch Thibault Vatter Department of Statistics Columbia University tv2233@columbia.edu Benoit Garbinato Department of Information Systems UNILHEC Lausanne benoit.garbinato@unil.ch
noticebox[b]Workshop on Modeling and DecisionMaking in the Spatiotemporal Domain, 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float
1 Introduction
The pervasiveness of mobile devices equipped with internet connectivity and globalpositioning functionality has resulted in an increasingly large amount of locationdata on individuals. This data is beneficial to address and validate spatiotemporal databased problems; predictive and kNN queries, object tracking, mobility modeling and location privacy among others. Due to the sensitive nature of datasets containing mobility traces, sharing them with untrusted entities present privacy implications. Trivial heuristics can be applied on such datasets to derive personally identifiable information of individuals, even at aggregate levels [35].
Publicly accessible mobility datasets [39, 24, 14] are usually not adequate for large scale experimental evaluations, compromising scalability tests. This issue incentivizes synthetic mobility trajectory generators that simulate the behavior of moving objects required to attain comprehensive performance valuations. In this context, one typically considers rigid and unnatural mobility models, not guaranteeing the existence or even cardinality of patterns within the synthetic population. Alternative approaches rely on parametric sequential models [16] and Markov processes [3] to learn and generate trajectories. Such techniques also ignore the presence of longrange dependencies[20] inherent to human mobility which features nonMarkovian character [38, 17].
It is therefore imperative to generate contextdependent synthetic traces resembling the humanmobility behavior at satisfactory utility levels while preserving user privacy. However, one of the major challenges is the absence of quantitative methods for evaluating the realistic nature of synthetic traces and the associated utilityprivacy tradeoff.
To this end, we present several nonparametric approaches to generate largescale synthetic trajectories by training the models on a realworld dataset followed by hallucinating trajectories using the trained model. We perform an extensive evaluation of the generated trajectories by assessing their geographic and semantic similarity compared to the actual dataset. We use two sample metrics to obtain the statistical similarity between datasets. We then quantify the presence of longrange dependencies by computing the mutualinformation decay and conduct privacyleakage tests on the generated trajectories. We conclude with a discussion on appropriate strategies and applicable evaluation metrics based on our experimental results and tackle open questions and challenges.
2 Related Work
Table 1 provides a summary of existing trajectory generators, where they formulate the synthetic trajectory simulation as an optimization problem, solved by genetic algorithms under the constraint of a priori determined parameters. A fundamental issue is the selection and definition of the parameter space that controls the evolution of the moving objects. The stringent and classified network connections thus influence the realistic nature of the generated trajectories. In several cases, there is no correlation between the future direction of movement and the past locations. Repeated visits to a given location within a short span of time are also observed due to the bounding parameters. Therefore, the symbolic nature of these frameworks result in an implicit locationdependent context, which compromises the realistic nature of the generated activity patterns. To address these drawbacks associated with parametric modeling, Ouyang et al. [28] propose a GANbased approach to generate trajectories, where the discriminator is based on a convolutional neural network (CNN) [19]. Similarly, we explore other deep learning architectures based on RNNs known to model sequential data better than CNNs [31]. We also investigate generative models based on the nonparametric copulas of [7].
Technique  Model name  Parameters considered  
Free movement  GSTD [34]  statistical distributions (mean, skew, standard deviation)  
GTERD [26]  speed, rotationangle, direction  
Oporto [8]  start time, end time, velocity, orientation  
Road networks  Brinkoff [4]  speed, street capacity, nearby object location, shortest path  
SUMO [2]  road length, headway time, lane change times  
BerlinMOD [5]  road network, trip start and end, Brinkoff model  
STACTS [9]  Geodependency model  
Hermoupolis [29]  mobility pattern, road network, points of interest  
Multi environments  MWGen [36]  trip plan, road network, floor plan, routing graph  
MNTG [23]  movement model, moving objects, simulation time  

Markov models [3]  semantic locations, geographic  
SemiMarkov models [1]  stay points, transition paths 
3 Synthesizing Trajectories using Generative Modeling
First we explore the benefits of applying deep learning architectures to synthesize mobility trajectories. RNNs use their hidden memory representation to process input sequences and we select four architectures: (1) CharRNN (SRNN) [11], (2) RNNLSTM [12], (3) recurrent highway networks (RHN) [40], and (4) pointer sentinel mixture model (PSMM) [22]. For GANs, where two neural networks compete in a zerosum game framework, we select two architectures: (1) SGAN [37], and (2) RGAN [6]. These architectures differ in their capacity to manipulate their internal memory representation and propagate gradients along the network. In addition to neuralnetwork based solutions, we also evaluate copulas; a seldom explored generative model in the machine learning community. Given a bivariate random vector , Sklar’s theorem [33] states that the joint density^{1}^{1}1It is usually stated for the distribution rather than the density and for random vectors of arbitrary dimension. is , where and are the marginal densities, the marginal distributions, and the copula density. In other words, the bivariate density can be uniquely described by the product between its marginal densities and a copula density representing its dependence structure. A useful consequence of this representation is that, by taking the logarithm on both sides, estimation of the joint density can be performed in two steps: the marginal distributions first, and the copula afterwards. In a nutshell, copulas allow to flexibly specify the marginal and joint behavior of random variables.
An important aspect in the generative context is that, because for any continuous random variable with distribution , the copula is a distribution with uniform margins. Hence, from a copula sample , one obtains a sample on the original scale using the inverse cumulative distributions via . For further details on copulas, we refer the reader to [13]. In this paper, we combine the kernelbased nonparametric copulas of [7] with the empirical distribution function of the margins obtain highly flexible models.
Data representation Given a dataset of mobility trajectories, where a trajectory of an individual is a temporally ordered sequence of tuples, such that, , where , the latitudelongitude coordinate pair and , the timestamp such that . We first transform the location data onto a uniform grid for dimensionality reduction using a technique that preserves spatial locality^{2}^{2}2Google S2: https://s2geometry.io/, thus translating into a 2D trace , where is the geohash of the projected cell ID and the timestamp .
4 Experiments, Results and Discussion
Experimental setup A complete trajectory sequence can be generated by iteratively feeding the current output trajectory sequence as input for the next step to the trained model. RNNs are trained on the geohashes and timestamps of all the individuals present in the dataset in a deterministic framework. GANs are first trained to model and then successively reproduce the traces in the same representation, which is mapped back to the coordinates. We use the standard implementations of the predictive algorithms and hyperparameters as described in their respective papers. To use copulas as generative models, we rely on the rvinecopulib package [25], whose vine routine implements the automatic kernelbased fitting of the dependence structure.
Dataset Experiments are performed using the Nokia mobile dataset [18] that consists of mobility trajectories of individuals collected in Switzerland. We use a total of 70M data points to train the considered models.
Evaluation We perform the evaluation of the generated trajectories using this dataset from four distinct dimensions: (1) geographic and semantic similarity, (2) statistical similarity (3) longrange dependencies and (4) privacy tests. In order to assess the geographic and semantic similarity, we compare the probability distribution of visiting locations (visittime and dwelltime) in the generated trajectories for each technique compared to the true dataset (see Figure 1). CharRNN, RGAN and copulas have the closest fit to the true distribution indicating that the locations are very well preserved in the respective synthetic datasets.
To evaluate the statistical similarity, we use Mean Maximum Discrepancy (MMD) [10] to test whether one can reject the null hypothesis that a synthetic sample has the same distribution as the data. The scatter plots are shown in Figure 2. MMD works by replacing the probability densities with embeddings that facilitate the computation of distances between distributions. Note that defining distance metrics in the context of time series data such as mobility trajectories is challenging due to the alignment concerns [6]. We thus consider the time axis for alignment as done by Esteban et al. [6]. The results along with the training and generation time for each approach is shown in Table 2, where we observe that all approaches achieve similar results in terms of MMD, with copulas standing out with a lower value. We can thus infer that copulas can synthesize distributions with statistical characteristics closer to the observed ones. Regarding the computational efficiency, copulas require a fraction of the time needed by NNbased approaches.
Metric/Method  CharRNN  RNNLSTM  RHN  PSMM  SGAN  RGAN  Copula 

MMD  0.32(1e3)  0.27(9e4))  0.30(1e3)  0.21(6e4)  0.19(7e4)  0.21(6e4)  0.01(6e4) 
CPU time (sec)  9k+10  10.3k+14  12.7k+15  10.5k+15  11.2k+15  11.5k+14  6.5 + 0.76 
Figure 3(a) shows the result of longrange dependency test, in terms of mutual information decay [21, 20]. We observe a powerlaw decay in case of GANs, copulas and RNNLSTM indicating that they account for the longrange dependencies in mobility trajectories. Figure 3(b) shows the results of two privacy tests: (1) locationsequence attack, and (2) membership interference attack. Given a synthetic dataset, (1) answers to what level of accuracy can trajectories in the dataset be reconstructed [32], and (2) an adversary’s accuracy of inferring if a target individual contributed to the specific trajectory [30]. For these tests, we use the the locationprivacy and mobility meter [32], where obfuscation is performed using the location hiding mechanism. Given a completely random distribution the accuracy of a recovered userinformation is 0, we therefore suspect that the privacybased score is biased towards representations which do not accurately capture the statistical properties of the true dataset.
5 Conclusion and Future Work
In this work, we propose and evaluate a variety of generative models to synthesize mobility trajectories. To the best of our knowledge, this is the first study to do so using seven different approaches while evaluating their realism across four dimensions. From the results and discussion, we observe that regarding statistical and semantic properties, copulas have an advantage over all other methods. Additionally, all NNbased methods are time consuming, which makes copulas favorable when computational efficiency is important to the endusers. As future work, we will consider datasets collected in bigger cities and generate larger synthetic datasets to evaluate the performance of these models under high movement stochasticity. From curbing the privacy leakage of the true dataset while maintaining utility, trajectory generation can be designed as an optimization problem with an objective to jointly maximize statistical similarity and privacy. But it is still not clear how to assess such property and adaptive/configurable metric, and is part of ongoing work [27]. While this paper represents an initial comparative study of various generative models, a deeper understanding of their performances will be needed to compute utilityprivacy scores as applied to online services by Krause and Horvitz [15] before publicly releasing the synthetic datasets. Another interesting avenue for research is to apply transfer learning in order to map a mobility behavioral model captured in one city on to another region.
References
 Baratchi et al. [2014] Mitra Baratchi, Nirvana Meratnia, Paul JM Havinga, Andrew K Skidmore, and Bert AKG Toxopeus. A hierarchical hidden semimarkov model for modeling mobility data. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 401–412. ACM, 2014.
 Behrisch et al. [2011] Michael Behrisch, Laura Bieker, Jakob Erdmann, and Daniel Krajzewicz. Sumo–simulation of urban mobility. In The Third International Conference on Advances in System Simulation (SIMUL 2011), Barcelona, Spain, volume 42, 2011.
 Bindschaedler and Shokri [2016] Vincent Bindschaedler and Reza Shokri. Synthesizing plausible privacypreserving location traces. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 546–563. IEEE, 2016.
 Brinkhoff [2002] Thomas Brinkhoff. A framework for generating networkbased moving objects. GeoInformatica, 6(2):153–180, 2002.
 Düntgen et al. [2009] Christian Düntgen, Thomas Behr, and Ralf Hartmut Güting. Berlinmod: a benchmark for moving object databases. The VLDB Journal—The International Journal on Very Large Data Bases, 18(6):1335–1368, 2009.
 Esteban et al. [2017] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Realvalued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
 Geenens et al. [2017] Gery Geenens, Arthur Charpentier, and Davy Paindaveine. Probit transformation for nonparametric kernel estimation of the copula density. Bernoulli, 23(3):1848–1873, 2017. ISSN 13507265.
 Giannotti et al. [2005] Fosca Giannotti, Andrea Mazzoni, Simone Puntoni, and Chiara Renso. Synthetic generation of cellular network positioning data. In Proceedings of the 13th annual ACM international workshop on Geographic information systems, pages 12–20. ACM, 2005.
 Gidofalvi and Pedersen [2006] Gyozo Gidofalvi and Torben Bach Pedersen. St–acts: a spatiotemporal activity simulator. In Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems, pages 155–162. ACM, 2006.
 Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 Grossberg [2013] Stephen Grossberg. Recurrent neural networks. Scholarpedia, 8(2):1888, 2013.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9 8:1735–80, 1997.
 Joe [2014] Harry Joe. Dependence Modeling with Copulas. Chapman & Hall/CRC, 2014.
 Kiukkonen et al. [2010] Niko Kiukkonen, Jan Blom, Olivier Dousse, Daniel GaticaPerez, and Juha Laurila. Towards rich mobile phone datasets: Lausanne data collection campaign. Proc. ICPS, Berlin, 2010.
 Krause and Horvitz [2008] Andreas Krause and Eric Horvitz. A utilitytheoretic approach to privacy and personalization. In AAAI, volume 8, pages 1181–1188, 2008.
 Kulkarni and Garbinato [2017] Vaibhav Kulkarni and Benoît Garbinato. Generating synthetic mobility traffic using rnns. In Proceedings of the 1st Workshop on Artificial Intelligence and Deep Learning for Geographic Knowledge Discovery, GeoAI ’17, pages 1–4, New York, NY, USA, 2017. ACM. ISBN 9781450354981. doi: 10.1145/3149808.3149809. URL http://doi.acm.org/10.1145/3149808.3149809.
 Kulkarni et al. [2018] Vaibhav Kulkarni, Abhijit Mahalunkar, Benoit Garbinato, and John D Kelleher. On the inability of markov models to capture criticality in human mobility. arXiv preprint arXiv:1807.11386, 2018.
 Laurila et al. [2012] Juha K Laurila, Daniel GaticaPerez, Imad Aad, Olivier Bornet, TrinhMinhTri Do, Olivier Dousse, Julien Eberle, Markus Miettinen, et al. The mobile data challenge: Big data for mobile computing research. In Pervasive Computing, number EPFLCONF192489, 2012.
 LeCun et al. [1995] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 Lin and Tegmark [2016] Henry W Lin and Max Tegmark. Critical behavior from deep dynamics: a hidden dimension in natural language. arXiv preprint arXiv:1606.06737, 2016.
 Mahalunkar and Kelleher [2018] Abhijit Mahalunkar and John D Kelleher. Understanding recurrent neural architectures by analyzing and synthesizing long distance dependencies in benchmark sequential datasets. arXiv preprint arXiv:1810.02966, 2018.
 Merity et al. [2016] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016.
 Mokbel and Aref [2008] Mohamed F Mokbel and Walid G Aref. Sole: scalable online execution of continuous queries on spatiotemporal data streams. The VLDB Journal—The International Journal on Very Large Data Bases, 17(5):971–995, 2008.
 Mokhtar et al. [2017] Sonia Ben Mokhtar, Antoine Boutet, Louafi Bouzouina, Patrick Bonnel, Olivier Brette, Lionel Brunie, Mathieu Cunche, Stephane D’Alu, Vincent Primault, Patrice Raveneau, et al. Priva’mov: Analysing human mobility through multisensor datasets. In NetMob 2017, 2017.
 Nagler and Vatter [2018] Thomas Nagler and Thibault Vatter. rvinecopulib: high performance algorithms for vine copula modeling, 2018. URL https://cran.rproject.org/package=rvinecopulib.
 Nascimento et al. [2003] Mario A. Nascimento, Dieter Pfoser, and Yannis Theodoridis. Synthetic and real spatiotemporal datasets. IEEE Data Eng. Bull., 26(2):26–32, 2003.
 Nasr et al. [2018] Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using adversarial regularization. arXiv preprint arXiv:1807.05852, 2018.
 Ouyang et al. [2018] Kun Ouyang, Reza Shokri, David S Rosenblum, and Wenzhuo Yang. A nonparametric generative model for human trajectories. In IJCAI, pages 3812–3817, 2018.
 Pelekis et al. [2015] Nikos Pelekis, Stylianos Sideridis, Panagiotis Tampakis, and Yannis Theodoridis. Hermoupolis: a semantic trajectory generator in the data science era. SIGSPATIAL Special, 7(1):19–26, 2015.
 Pyrgelis et al. [2017] Apostolos Pyrgelis, Carmela Troncoso, and Emiliano De Cristofaro. Knock knock, who’s there? membership inference on aggregate location data. arXiv preprint arXiv:1708.06145, 2017.
 Schmidhuber [2015] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
 Shokri et al. [2011] Reza Shokri, George Theodorakopoulos, JeanYves Le Boudec, and JeanPierre Hubaux. Quantifying location privacy. In 2011 IEEE symposium on security and privacy, pages 247–262. IEEE, 2011.
 Sklar [1959] Abe Sklar. Fonctions de répartition à n dimensions et leurs marges. Publications de L’Institut de Statistique de L’Université de Paris, 8:229–231, 1959.
 Theodoridis et al. [1999] Yannis Theodoridis, Jefferson RO Silva, and Mario A Nascimento. On the generation of spatiotemporal datasets. In International Symposium on Spatial Databases, pages 147–164. Springer, 1999.
 Xu et al. [2017] Fengli Xu, Zhen Tu, Yong Li, Pengyu Zhang, Xiaoming Fu, and Depeng Jin. Trajectory recovery from ash: User privacy is not preserved in aggregated mobility data. In Proceedings of the 26th International Conference on World Wide Web, pages 1241–1250. International World Wide Web Conferences Steering Committee, 2017.
 Xu and Güting [2012] Jianqiu Xu and Ralf Hartmut Güting. Mwgen: a mini world generator. In Mobile Data Management (MDM), 2012 IEEE 13th International Conference on, pages 258–267. IEEE, 2012.
 Yu et al. [2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
 Zhao et al. [2015] ZhiDan Zhao, ShiMin Cai, and Yang Lu. Nonmarkovian character in human mobility: Online and offline. Chaos: An Interdisciplinary Journal of Nonlinear Science, 25(6):063106, 2015.
 Zheng et al. [2010] Yu Zheng, Xing Xie, and WeiYing Ma. Geolife: A collaborative social networking service among user, location and trajectory. IEEE Data Eng. Bull., 33:32–39, 2010.
 Zilly et al. [2017] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017.