Interdependence and Predictability
of Human Mobility and Social Interactions
Previous studies have shown that human movement is predictable to a certain extent at different geographic scales. Existing prediction techniques exploit only the past history of the person taken into consideration as input of the predictors.
In this paper, we show that by means of multivariate nonlinear time series prediction techniques it is possible to increase the forecasting accuracy by considering movements of friends, people, or more in general entities, with correlated mobility patterns (i.e., characterised by high mutual information) as inputs. Finally, we evaluate the proposed techniques on the Nokia Mobile Data Challenge and Cabspotting datasets.
keywords:mobility prediction, mutual information, nonlinear timeseries analysis
The study of the interdependence of human movement and social ties of individuals is one of the most interesting research areas in computational social science computationalSocialScience09. Previous studies have shown that human movement is predictable to a certain extent at different geographic scales BHG:Scaling; SMML11:nextplace; Song:Limits. The potential uses of these prediction techniques are various, including practical ones, such as content dissemination of location-aware information, e.g., targeted advertisements in sponsored mobile applications or in search results performed from mobile phones LGKO04:Bluetooth.
In this paper we show how it is possible to improve mobility prediction by exploiting the correlation between movements of individuals. It is possible to exploit such correlations for prediction and inference of aspects related to user behaviour, namely their movements and their social interactions (either physical and distant). In particular, in our analysis we exploit and adapt the concept of mutual information infotheory in order to quantify correlation and provide a practical method for the selection of additional data to improve the accuracy of movement forecasting. We also show how this quantity correlates to different types of social interactions of friends and acquaintances. This paper extends the findings presented in our submission DMM12:Interdependence to the Nokia Mobile Data Challenge competition NokiaMDC.
More specifically, the contributions of this work can be summarised as follows:
We first show that by means of a multivariate nonlinear predictor kantz1997nonlinear we are able to achieve a very high degree of accuracy in forecasting future user geographic locations in terms of longitude and latitude. We compare it with traditional linear prediction techniques (such as ARMA Cha04) and we show that these are not able to capture the dynamics of individuals in the geographic space.
We discuss how the concept of mutual information can be used to quantify the correlation between two mobility traces and we demonstrate that it is possible to exploit movement data of friends and acquaintances, when such information is available.
Finally, we study how the correlation measured through mutual information of mobility traces of two individuals, can be used to improve human prediction movement dramatically, also discussing the correlation between human mobility and social ties.
The key findings of our analysis are the following: 1) mobility correlation can be used to improve movement forecasting by exploiting mobility data of friends; 2) correlated movement is linked to the existence of physical or distant social interactions and vice versa.
We evaluate these findings on two datasets. The first dataset, which was provided for the Nokia Mobility Data Challenge (NMDC), contains information related to 39 users NokiaMDC, including the following: GPS traces, telephone numbers, call and SMS history, Bluetooth and WLAN history. We use the information of 25 of them, since the dataset does not include phone numbers for 14 of them; therefore, it is not possible to detect if and when phone calls occur between them. We use GPS traces to analyse the movement of the users.
The second dataset we analyse is Cabspotting cabspotting_dataset, containing mobility traces of about 500 taxis driving around San Francisco for 30 days. We restrict our analysis to the 178 taxis with mobility traces longer than 25000 GPS readings. For this dataset we have no information about relations between taxi drivers (such as friendship connections or co-affiliation).
The paper is organised as follows. In Section 2 we firstly introduce multivariate nonlinear time series prediction techniques and their application to our datasets. Then, in Section 3 we discuss how mutual information can be used to measure the correlation between the movement of two users. Section 4 focusses on the analysis of the performance improvement that is possible to obtain by considering the traces of highly correlated users as inputs of the predictors. In Section 5, we discuss our findings also outlining some future directions. Section 6 concludes the paper, summarising our contributions.
2 Multivariate Nonlinear Time Series Prediction
We now present how we apply nonlinear time series prediction methods to the problem of forecasting the future GPS coordinates of the users, given the past movement history as an input. We will then extend this model by considering also the movement of other users (in particular friends, in the case of the NMDC dataset) as input of the nonlinear predictor.
We model the position of a user on the Earth with a time-varying four-dimensional state vector with the following dimensions: hour of the day , latitude , longitude and altitude111The corresponding time series is available only in the NMDC dataset: in the case of Cabspotting data we use a time-varying three-dimensional state vector. . The prediction of the future states of vector can be performed using different approaches kantz1997nonlinear. We choose the method based on the reconstruction of the phase space of by means of the delay embedding theorem, since this is considered the best state-of-the-art solution to this problem. While the scalar sequence of coordinates may appear completely non deterministic, it is possible to uncover the characteristics of its dynamic evolution by analysing sub-sequences of the time series itself. In order to investigate the structure of the original system, the time series values must be transformed in a sequence of vectors with a technique called delay embedding. For a univariate time series measurement of a dimensional dynamical system, the Takens’ embedding theorem takens1981detecting allows to reconstruct a dimensional space () with the same dynamical characteristics of the original phase space. The key idea is to build a delay vector by using delayed measurement defined as follows:
where is a time delay. Hence, the reconstruction depends on the two parameters and , which have to be estimated. This technique can be extended to the case of the embedding of a multivariate time series222We refer to vlachos2009state (and references therein) for an overview of practical applications of multivariate embedding. cao1998dynamics.
Under the hypotheses of Takens’ theorem, i.e., non-noisy time series of infinite length, the underlying dynamics can be fully reconstructed by using only univariate measurements of the dynamical system of interest. Unfortunately, real-world measurement are noisy and with finite length: hence, the phase space reconstruction is more precise if multivariate measurements of the dynamical system under investigation are performed.
Let us indicate with the number of measurements corresponding to an dimensional time series , , …, , with and . The resulting delay vector is
where and , are respectively the embedding and time delays corresponding to each component of the multivariate time series.
Intuitively, this method searches the past history to find and extract sequences of values that are very similar to the recent history. Assuming a certain degree of determinism in the system, the assumption is that, given a certain state (in our case geographic coordinates), there is a strong probability that this will be followed by the same next state.
2.2.1 Linearity Analysis
The complexity of the time series taken into consideration in our study is apparent by observing the two representative examples shown in Fig. 1 and Fig. 2. The figures show thousands of time-ordered GPS measurements corresponding to the position on the Earth of user 179 (NMDC dataset) and taxi abgibo (Cabspotting dataset), respectively.
We firstly apply linear prediction models to these time series. The time series appear rather noisy with alternating spikes, nearly flat values, corresponding to stationary points, and fluctuation around an average value. We try to model such movements in the space with a simple multivariate AR + noise process.
As for the order of the multivariate autoregressive model that best approximates the original time series, we choose the one that minimises an information criterion, according to Akaike akaike1974new and Schwarz schwarz1978estimating. We find that provides the best approximation. Hence, we use such a model to perform a multivariate linear forecasting of 1000 GPS measurements for user 179 (NMDC dataset). We validate the model by comparing the latest 1000 real GPS measurements against the forecasted ones333The latest 1000 real GPS measurements have not been included in the procedure adopted to estimate the best order .. The results are shown in Fig. 3, where the real movements are indicated with dots and the forecasting with the linear model is indicated by the solid line. It is evident that the forecasting is not in agreement with observations. In fact, the prediction error on the position (latitude and longitude) is of the order of , whereas the error on the altitude is generally larger than 600 m.
However, although the time series are not regularly sampled, we find that they show some features typical of deterministic dynamics contaminated by noise. In fact, preliminary inspection of phase space reconstruction by means of Takens’ embedding theorem shows an underlying structure, typical of deterministic dynamical systems. This aspect will be addressed more quantitatively in the remainder of the paper.
2.2.2 Estimation of the Embedding Dimension and Time Delay
Although several methods have been proposed to estimate the values of embedding and time delay, in our analysis we consider the same time delay for all the series. In fact, for a given user, we have found that the time delay corresponding to the first local minimum of the average mutual information fraser1986independent, generally adopted to estimate in the univariate case, is of the same order of magnitude for any component. As a representative example, in Fig. 4 we show the distribution of obtained from the time series of latitude and longitude of taxis in the Cabspotting dataset.
This fact has also practical implications, since it simplifies the application of this methodology for the analysis of our data. The optimal embedding dimension is estimated by exploiting the method of false nearest neighbours kennel1992determining; kantz1997nonlinear; hegger1999improved in the case of multivariate embedding boccaletti2002reconstructing. For any point in the data, an -dimensional phase space is considered and the number of false nearest neighbours, i.e., points which are neighbours in the dimensional space but not in the dimensional one, is estimated. The desirable embedding dimension is such that the percentage of false nearest neighbours is small, e.g., below 5%. Any efficient algorithm for counting nearest neighbours is allowed: in particular, we adopt the method implemented in the TISEAN software hegger1999practical. In the left panel of Fig. 5 we show the fraction of false nearest neighbours as a function of , obtained from mobility traces in the Cabspotting dataset. For any trace, the optimal embedding dimension is close to 30, confirming that the underlying dynamics is very similar444We find only a few exceptions whose number represents less than 5% of the mobility traces in the whole dataset.. The false nearest neighbour method alone is not able to distinguish between deterministic and stochastic processes on an absolute level hegger1999improved: however, it is among the state-of-the-art solutions Farmer87; Sugihara90; Barahona96; Schmitz97; dedomenico2010fast that can be reliably used to asses the nonlinearity of time series by means of a statistical test with surrogate data.
2.2.3 Analysis of Multivariate Surrogates
Given a multivariate time series , we produce a set , , of multivariate surrogates of . The surrogates are synthetic time series, built from , preserving both statistical and linear features of the original time series, as probability distribution and autocorrelation, while removing the effects of nonlinearities and nonstationarities, if any. In particular, we adopt the iterative amplitude-adjusting Fourier transform (IAAFT) scheme Schr-Schm96; Schreiber98 to build surrogates. Hence, we choose the fraction of false nearest neighbours as discriminator to test the null hypothesis that the mobility traces can be described by a linear stochastic model. Let us indicate with the value of the discriminator obtained for an embedding dimension from the observed multivariate time series, and with the values of the discriminator obtained from surrogates. Our numerical experiments indicate that the distribution of is described with a reasonable approximation by a Gaussian function with average and variance . This fact allows us to define the quantity
as a measure of significance. In this case, if the null hypothesis is true then the value of observing a significance equal or larger than is given by . We fix a priori the size of our hypothesis testing: if (or, equivalently, if ) the null hypothesis that mobility can be described by a linear stochastic model is rejected with 95% confidence level (CL). In the right panel of Fig. 5 we show the significance as a function of for mobility traces in the Cabspotting dataset. Remarkably, the significance is much larger than 1.96 for all traces despite a few exceptions, independently from the embedding dimension chosen for the reconstruction. Hence, we can conclude that human mobility exhibits a strong nonlinear dynamics. Moreover, the existence of short-term correlations, as indicated by the average mutual information analysis and of decreasing fraction of false nearest neighbours for increasing embedding dimension suggests that such a dynamics should have a deterministic component potentially contaminated by a stochastic dynamics.
2.2.4 Analysis of Prediction Errors
Dealing with nonlinear dynamical systems with a potential deterministic component, we adopt a method which exploits such features to predict the future movements of users in the NMDC dataset and of taxis in the Cabspotting dataset. The multivariate nonlinear prediction (MNP) is performed by approximating the dynamics locally in the phase space by a constant (see casdagli1989nonlinear for further information). In the delay embedding space, all the points in the neighbourhood of the state are taken into account in order to predict the coordinates at time . Hence, the forecast for is given by
i.e., the average over the states which correspond to measurements steps ahead of the neighbours .
Hence, we use MNP to forecast the same 1000 GPS measurements previously discussed in the case of NMDC dataset. Again, we validate the model by comparing the latest 1000 real GPS measurements against the forecasted ones. The results for user 179 are shown in Fig. 3, where the real movements are indicated with triangles and the forecasting with the nonlinear method is indicated by the dashed line. The number of nearest neighbours used to build the neighbourhood has been kept fixed to 10. Intriguingly, the nonlinear forecasting is in excellent agreement with observations of latitude and longitude, with a global position prediction error equal to , and in good agreement with the altitude coordinate, with a global altitude forecasting error equal to 219.43 m.
The global error on the time series prediction is estimated separately for each component using the following formula:
with with , . The overall error between the predicted position and the real one is given by the geodesic distance.
3 Mutual Information and Movement Correlation
In this section, we will briefly introduce the concept of mutual information and we will show how this quantity can be exploited in our analysis to measure the correlation between the movement of different individuals. In the following section, we will then discuss how mutual information can be used to select mobility data of other users that can be exploited as inputs of the nonlinear predictors in order to improve the prediction accuracy.
Let us assume that and are two multivariate stochastic variables, and let us indicate with and , respectively, the corresponding Probability Density Functions (PDF). The joint probability is indicated by . The mutual information between such two variables is defined as follows:
The mutual information555The units of mutual information are nats when the natural logarithm is used. quantifies how much information the variable provides about the variable . For this reason, it can be used as an estimator of the amount of correlation between and . In fact, if the two variables are totally uncorrelated then and .
In our analysis represents the motion of a user on the Earth, the random samples drawn from correspond to geographic coordinates, whereas the PDF of quantifies the fraction of time spent by the user in a particular position.
We use the mutual information to quantify the amount of correlation between the motion of different users, or, equivalently, how much information the motion provides about the motion .
In the NMDC dataset, we say that two individuals are friends or acquaintances if one of them is in the other’s address book. In Fig. 6 the two-dimensional PDF of positions occupied by four different users is shown. Users 063 and 123 are friends or acquaintances, while users 026 and 127 are not.
4 Exploiting Movement Correlation and Social Ties to Improve Prediction Accuracy
We now discuss how mobility traces of individuals that have correlated geographic patterns and social ties can be used to improve the accuracy of movement forecasting.
4.1 Our Approach at a Glance
Our approach can be summarised as follows: assuming that we want to predict the movement of person/entity , instead of having only the vector describing the location of as input, we will also consider the movement history of another person/entity , characterised by mobility patterns that are strongly correlated to those of the user we would like to predict. This measure is given by the mutual information introduced in the previous section.
From a mathematical point of view, the idea is to use a 8-dimensional vector that is given by the juxtaposition of the two time-varying state vectors representing the states (time-stamped GPS coordinates) of and , which we indicate with and , as inputs of the multivariate nonlinear predictor.
In both datasets we find that by using additional traces of pairs with high correlation, the accuracy of the prediction improves consistently. In the case of NMDC, the improvement is of at least one order of magnitude (and often of two orders of magnitude) with respect to the prediction based on only single traces. Moreover, it is interesting to note that social ties can also be used to select the user for the additional traces as input. In fact, we find that if we select mobility patterns of individuals that are in the address book of the user, the performance of the predictor improves dramatically. At the same time, we would like to stress the fact that the NMDC dataset contains a small number of users, therefore it is difficult to make claims about the general validity of these findings. However, we find the same results for much larger Cabspotting dataset. In this dataset, it is not possible to use social ties666In theory, it might be interesting to investigate the influence of the social ties between taxi drivers, but this information unfortunately is not available in the dataset., but we find that if we select mobility patterns of taxis whose mutual information is high, the performance of the predictor improves drastically.
Hence, we perform the same analysis described in Section 2, but including the time series of movements corresponding to other users in the multivariate nonlinear prediction. The global prediction error, defined by Eq. (5), of position and altitude is reported in Tab. 1 for three pairs in the NMDC dataset. As shown in this table, we observe that the additional information provided by the movement of a user socially linked to that taken into consideration improves the prediction by more than one order of magnitude with respect to the case of users who are not socially linked to each other.
|Nodes||Social link||Position Error||Altitude Error|
|026 127||None||66.33 m|
|063 123||Present||20.95 m|
|094 009||Present||5.57 m|
For each pair of users in the NMDC dataset, we count the total number of Bluetooth contacts and calls, then we estimate their mutual information defined by Eq. (6). In order to quantify the amount of correlation between the fraction of contacts and the mutual information, we build a scatter plot between these two observables. The result is shown in the left panel of Fig. 7, by considering only pair of users with at least one contact. The points corresponding to pairs of users with social ties are also shown (triangles). In the right panel of Fig. 7, we show the PDF of mutual information obtained by considering only pairs of users with no contacts at all. The mutual information corresponding to pairs of users with social ties is shown (arrows). Even if these plots show interesting correlations for this specific dataset, we believe no generalisations can be drawn from them, because of the lack of sufficient statistics.
Hence, we perform the same analysis by exploiting the mobility patterns of taxis in the Cabspotting dataset, which contains a larger statistical sample. In this case, the global prediction error, defined by Eq. (5), refers only to latitude and longitude. Moreover, we investigate the evolution of the global prediction error by estimating how it changes versus time. More specifically, we define the time-varying global prediction error for each component as
with and indicating the prediction interval. Hence, the overall error between the predicted position and the real one at time is given by . In order to investigate the quality of our prediction, we study the ratio of with respect to the global statistical uncertainty on the position of the taxi. In fact, as long as the ratio is equal or smaller than one, or, equivalently, if , the prediction at time is within the statistical uncertainty and, therefore, the performance of our predictor can be considered satisfactory. In Fig. 8 we show the cumulative distribution of the values of the ratio obtained from mobility traces in the Cabspotting dataset. In particular, we show the distributions corresponding to the predicted positions after 5 minutes and 30 minutes. The three curves correspond to prediction involving: a) only the past history of each single taxi (“Single”), b) the history of any pairs of taxi whose mobility patterns show a low mutual information (“Combined, Low MI”) and c) the history of any pairs of taxi whose mobility patterns show a mutual information (“Combined, High MI”). It is worth remarking that the mutual information is not an upper-bounded measure of correlation: hence, we define “High MI” all pairs of mobility patterns whose mutual information is distributed among the highest 5% of values, and “Low MI” the remaining pairs of mobility patterns. In both panels of Fig. 8, we can observe that the prediction improves when combining pairs of correlated mobility patterns. Moreover, it is intriguing that our method is able to predict in the 80% of cases the movements of taxis for the next 30 minutes, with an error equal or smaller than the statistical uncertainty of their mobility patterns.
Since the prediction is acceptable when the ratio is below one, we investigate how the fraction of mobility traces satisfying this requirement, i.e., , changes over time. In Fig. 9 we show this temporal evolution, with 90% confidence bands around the average values. Due to the lack of statistics (137 mobility traces) in the “Single” traces prediction, the bands are wider than for other cases (9316 mobility traces). All curves show decreasing behaviour for increasing prediction interval, as expected. In fact, the “Combined, High MI” predictor performs equal or better than others up to about 90 minutes. It is worth mentioning that the forecasting of every predictor is within the statistical uncertainity () for more than 50% of mobility traces considered up to 3 hours.
In the context of mobile applications, the prediction of mobility patterns of users is of great interest for several reasons. For instance, mobility forecasting could be used to determine where the person will be and who he/she will meet. Such an information can enable location-based mobile applications to provide personalised services relating to the context the user is in.
However, we are aware that there are scalability issues related to the implementation and the deployment of the proposed technique. In particular, it is well known that calculating mutual information in a multidimensional environment (in this case, for a number of users larger than two) is computationally expensive and does not scale efficiently. In fact, in this case the computational complexity scales as , where is the subset of users and is the cardinality of the tuple taken into account. However, the problem we are dealing with usually involves no more than 100 mobility traces (e.g., the size of the circle of most significant friends for an individual). For this reason, we can still evaluate mutual information values for any pair of traces, which scales as . Nonetheless, the multivariate embedding reconstruction is not feasible for a phase-space larger than 40-dimensional. Even for a 2-coordinate signal representing a mobility trace, it is not unusual to have a large embedding reconstruction due to noisy data. Hence, no more than three mobility traces should be considered simultaneously. Moreover, we are aware that the algorithm for searching the most suitable additional set of mobility traces for improving the prediction scales as , where is the number of users taken into consideration in the application.
It is worth noting that many factors could be considered as signals of social ties, according to the context of the deployment scenarios. As a consequence, the quality of predictions might be deeply affected, either positively or negatively, by the criteria used to detect social ties. In the Nokia MDC dataset, we had no information about social ties between individuals, neither of real nor virtual nature. In the available dataset, the presence of an individual in the address book of another one actually represents the strongest definition of a social tie. Moreover, two individuals with no social ties might show similar mobility patterns, resulting in a high value of mutual information. It is likely that individuals with strong social ties (students, friends, co-workers and so on) behave similarly and their mobility traces are characterised by patterns with a high value of mutual information. Hence, the accuracy of the predictor will be improved in the case the dynamics of traces is highly correlated, even if a social tie does not exist.
A possible refinement of this work is the use of multivariate nonlinear prediction with non-uniform embedding (different delays) and local polynomial fitting su2010prediction in order to increase the accuracy of the prediction.
In this paper, we have shown discussed multivariate nonlinear time series techniques can be successfully applied to improve the prediction of movement of users, by considering the movement of people with correlated mobility patterns. More specifically, through the analysis of the Nokia Mobile Data Challenge traces, we have shown that it is possible to exploit the correlation of social interactions and user movement in order to improve the accuracy of forecasting of the future geographic position of a user. By means of the Cabspotting dataset we have also shown that when information about social ties is not available, mutual information can be used to select pairs of users in order to improve prediction accuracy.
In other words, mobility correlation, measured by means of mutual information and the presence of social ties can be used to improve movement forecasting by exploiting mobility data of other individuals. This correlation can be used as an indicator of potential existence of physical or distant social interactions and vice versa.