Cluster-Aided Mobility Predictions

# Cluster-Aided Mobility Predictions

## Abstract

Predicting the future location of users in wireless networks has numerous applications, and can help service providers to improve the quality of service perceived by their clients. The location predictors proposed so far estimate the next location of a specific user by inspecting the past individual trajectories of this user. As a consequence, when the training data collected for a given user is limited, the resulting prediction is inaccurate. In this paper, we develop cluster-aided predictors that exploit past trajectories collected from all users to predict the next location of a given user. These predictors rely on clustering techniques and extract from the training data similarities among the mobility patterns of the various users to improve the prediction accuracy. Specifically, we present CAMP (Cluster-Aided Mobility Predictor), a cluster-aided predictor whose design is based on recent non-parametric bayesian statistical tools. CAMP is robust and adaptive in the sense that it exploits similarities in users’ mobility only if such similarities are really present in the training data. We analytically prove the consistency of the predictions provided by CAMP, and investigate its performance using two large-scale datasets. CAMP significantly outperforms existing predictors, and in particular those that only exploit individual past trajectories.

\IEEEoverridecommandlockouts

2pt 2pt

## 1 Introduction

Predicting users’ mobility in wireless networks has received a great deal of attention recently, strongly motivated by a wide range of applications. Examples of such applications include: location-based services provided to users by anticipating their movements (e.g., mobile advertisement, recommendation systems, risk alarm); urban traffic engineering and forecasting; the design of more efficient radio resource allocation protocols (e.g., scheduling and handover management [1], data prefetching [2] and energy efficient location sensing [3]). However, for these applications to significantly benefit from users’ mobility predictions, the latter should be made with a sufficiently high degree of accuracy.

Many mobility prediction methods and algorithms have been devised over the last decade, see e.g. [4, 5, 3, 6]. The algorithms proposed so far estimate the next location of a specific user by inspecting the data available about her past mobility, i.e., her past trajectory, and exploit the inherent repeated patterns present in this data. These patterns correspond to the regular behavior of the user, e.g. commuting from home to work or visiting favourite restaurants, and need to be extracted from the data to provide accurate predictions. To this aim, one has to observe the behavior of the user over long periods of time. Unfortunately, gathering data about users’ mobility can be quite challenging. For instance, detecting the current location of a user with sensors (e.g., GPS, Wi-Fi and cell tower) consumes a non-negligible energy. Users may also hesitate to log their trajectories to preserve their privacy. In any case, when the data about the mobility of a given user is limited, it is hard to identify her typical mobility patterns, and in turn difficult to provide accurate predictions on her next move or location.

In this paper, we aim at devising mobility predictors that perform well even if the past trajectories gathered for the various users are short. Our main idea is to develop cluster-aided predictors that exploit the data (i.e., past trajectories) collected from all users to predict the next location of a given user. These predictors rely on clustering techniques and extract from the training data similarities among the mobility patterns of the various users to improve the prediction accuracy. More precisely, we make the following contributions:

• We present CAMP (Cluster-Aided Mobility Predictor), a cluster-aided predictor whose design is based on recent non-parametric bayesian statistical tools [7, 8]. CAMP extracts, from the data, clusters of users with similar mobility processes, and exploit this clustered structure to provide accurate mobility predictions. The use of non-parametric statistical tools allows us to adapt the number of extracted clusters to the training data (this number can actually grow with the data, i.e., with the number of users). This confers to our algorithm a strong robustness, i.e., CAMP exploits similarities in users’ mobility only if such similarities are really present in the training data.

• We derive theoretical performance guarantees for the predictions made under the CAMP algorithm. In particular, we show that CAMP can achieve the performance of an optimal predictor (among the set of all predictors) when the number of users grows large, and for a large class of mobility models.

• Finally, we compare the performance of our predictor to that of other existing predictors using two large-scale mobility datasets (corresponding to a Wi-Fi and a cellular network, respectively). CAMP significantly outperforms existing predictors, and in particular those that only exploit individual past trajectories to estimate users’ next location.

## 2 Related work

Most of existing mobility prediction methods estimate the next location of a specific user by inspecting the past individual trajectories of this user. One of the most popular mobility predictors consists in modelling the user trajectory as an order- Markov chain. Predictors based on the order- Markov model are asymptotically optimal [9, 6] for a large class of mobility models. This optimality only holds asymptotically when the length of the observed user past trajectory tends to infinity. Unfortunately, when the observed past trajectory of the user is rather short, these predictors perform poorly. Such phenomenon is often referred to as the “cold-start problem”. To improve the performance of these predictors for short histories, a fallback mechanism can be added [4] to reduce the order of the Markov model when the current sequence of previous locations has not been encountered before. Alternatively, one may adapt the order of the Markov model used for prediction as in the Sampled Pattern Matching (SPM) algorithm [6], which sets the order of the Markov model to a fraction of the longest suffix match in the history. SPM is asymptotically optimal with provable bounds on its rate of convergence, when the trajectory is generated by a stationary mixing source. Another type of mobility predictor, Nextplace [5] attempts to leverage the time-stamps that may be associated with the successive locations visited by the user. Empirical evaluations [4, 3] show that complex mobility models do not perform well: the order- Markov predictor with fallback gives comparable performance to that of SPM [6], NextPlace [5] and higher order Markov predictors. In addition [3] reports that the order- Markov predictor can actually provide better predictions than higher order Markov predictors, as the latter suffer more from the lack of training data.

There have been a few papers aiming at clustering trajectories or more generally stochastic processes. For example, [10] proposes algorithms to find clusters of trajectories based on likelihood maximization for an underlying hidden Markov model. For the same problem, [11] uses spectral clustering in a semi-parametric manner based on Bhattacharyya affinity metric between pairs of trajectories. Those methods would not work well in our setting. This is due to the facts that (i) users belonging to a same cluster should have trajectories generated by identical parameters, and (ii) the number of clusters should be known beforehand, or estimated in a reliable way. The non-parametric Bayesian approach developed in this paper addresses both issues. [12] also introduced Bayesian approach that focused on the similarity between users’ temporal patterns. But they do not consider the similarity between spatial trajectories and the correlation to the recent locations which are crucial to the correct predictions in our setting.

## 3 Models and Objectives

In this section, we first describe the data on past user trajectories available at a given time to build predictors. We then provide a model for user mobility, used to define our non-parametric inference approach, as well as its objectives.

### 3.1 Collected Data

We consider the problem of predicting at a given time the mobility, i.e., the next position of users based on observations about past users’ trajectories. These observations are collected and stored on a server. The set of users is denoted by , and users are all moving within a common finite set of locations. The trajectory collected for user is denoted by , where corresponds to the -th location visited by user , and where refers to the length of the trajectory. denotes the current location of user . By definition, we impose , i.e., two consecutive locations on a trajectory must be different. Let denote the set of user trajectories. Observe that the lengths of the trajectories may vary across users. If the location of a user is sensed periodically, we can collect the time a given user has stayed at each location. Those staying times for user are denoted by , where is the staying time at the -th visited location. To simplify the presentation, we present our prediction methods ignoring the staying times ; but we mention how to extend our approach to include staying times in §4.2.4.

Next we introduce additional notations. We denote by the number of observed transitions for user from location to , (i.e., ). Similarly, is the number of times user has been observed at location . Let denote the set of all possible trajectories of a given user, and let be the set of all possible set of trajectories of users in .

### 3.2 Mobility Models

The design of our predictors is based on a simple mobility model. We assume that user trajectories are order-1 Markov chains, with arbitrary initial state or location. More precisely, user-’s trajectory is generated by the transition kernel , where denotes the probability that user moves from location to along her trajectory. Hence, given her initial position , the probability of observing trajectory is . Our mobility model can be readily extended to order- Markov chains. However, as observed in [3], order-1 Markov chain model already provides reasonably accurate predictions in practice, and higher-order models would require a fall-back mechanism1[4]. Throughout the paper, we use uppercase letters to represent random variables and the corresponding lowercase letters for their realizations, e.g. (resp. ) denotes the random (resp. realization of) trajectory of user .

### 3.3 Bayesian Framework, Clusters, and Objectives

We adopt a Bayesian framework, and assume that the transition kernels of the various users are drawn independently from the same distribution 2 referred to as the prior distribution over the set of all possible transition kernels . This assumption is justified by De Finetti’s theorem (see [13], Theorem 11.10) if are exchangeable (which is typically the case if users are a priori indistinguishable). In the following, the expectation and probability under are denoted by and , respectively. To summarize, the trajectories of users are generated using the following hierarchical model: for all , , , and are arbitrarily fixed.

To provide accurate predictions even if observed trajectories are rather short, we leverage similarities among user mobility patterns. It seems reasonable to think that the trajectories of some users are generated through similar transition kernels. In other words, the distribution might exhibit a clustered structure, putting mass around a few typical transition kernels. Our predictors will identify these clusters, and exploit this structure, i.e., to predict the next location of a user , we shall leverage the observed trajectories of all users who belong to user-’s cluster.

For any user , we aim at proposing an accurate predictor of her next location, given the observed trajectories of all users. The (Bayesian) accuracy of a predictor for user , denoted by , is defined as (where for conciseness, we write ). Clearly, given , the best possible predictor would be:

 ^xu∈argmaxj∈LE[θuxunu,j|xU]. (1)

Computing this optimal predictor, referred to as the Bayesian predictor with prior , requires the knowledge of . Indeed:

 Extra open brace or missing close brace (2)

Since here the prior distribution is unknown, we will first estimate from the data, and then construct our predictor according to (1)-(2).

## 4 Bayesian Non-parametric Inference

In view of the model described in the previous section, we can devise an accurate mobility predictor if we are able to provide a good approximation of the prior distribution on the transition kernels dictating the mobility of the various users. If concentrates its mass around a few typical kernels that would in turn define clusters of users (i.e., users with similar mobility patterns), we would like to devise an inference method identifying these clusters. On the other hand, our inference method should not discover clusters if there are none, nor specify in advance the number of clusters (as in the traditional mixture modelling approach). Towards these objectives, we apply a Bayesian non-parametric approach that estimates how many clusters are needed to model the observed data and also allows the number of clusters to grow with the size of the data. In Bayesian non-parametric approaches, the complexity of the model (here the number of clusters) is part of the posterior distribution, and is allowed to grow with the data, which confers flexibility and robustness to these approaches. In the remaining of this section, we first present an overview of the Dirichlet Process mixture model, a particular Bayesian non-parametric model, and then apply this model to the design of CAMP (Cluster-Aided Mobility Predictor), a robust and flexible prediction algorithm that efficiently exploits similarities in users’ mobility, if any exist.

### 4.1 Dirichlet Process Mixture Model

When applying Bayesian non-parametric inference techniques [7] to our prediction problem, we add one level of randomness. More precisely, we approximate the prior distribution on the transition kernels by a random variable with distribution . This additional level of randomness allows us to introduce some flexibility in the number of clusters present in . We shall compute the posterior distribution given the observations , and hope that this posterior distribution, denoted as , will concentrate its mass around the true prior distribution . To evaluate , we use Gibbs sampling techniques (see Section 4.2.1), and from these samples, we shall estimate the true prior , and derive our predictor by replacing by its estimate in (1)-(2).

For the higher-level distribution , we use the Dirichlet Process (DP) mixture model, a standard choice of prior over infinite dimensional spaces, such as . The DP mixture model has a possibly infinite number of mixture components or clusters, and is defined by a concentration parameter , which impacts the number of clusters, and a base distribution , from which new clusters are drawn. The DP mixture model with parameters and is denoted by and defined as follows. If is a random measure drawn from (i.e., ), and is a (measurable) partition of , then follows a Dirichlet distribution with parameters 3. It is well known [14] that a sample from has the form where is the Dirac measure at point , the ’s are i.i.d. with distribution and represent the centres of the clusters (indexed by ), and the weights ’s are generated using a Beta distribution according to the following stick-breaking construction:

 ˜βc ∼ Beta(1,α) (the ˜βc's are independent), βc = ˜βcc−1∏i=1(1−˜βi).

When is generated under the above DP mixture model, we can compute the distribution of given . When is fixed, then users in are clustered and the set of corresponding clusters is denoted by . Users in cluster share the same transition kernel , and the number of users assigned to cluster is denoted by . The distribution of given is then:

 θu|θU∖u ∼ ⎧⎨⎩G0w.p. αα+|U|−1,δ¯θcw.p. nc,−uα+|U|−1,∀c∈cU∖{u}. (3)

(3) makes the cluster structure of the DP mixture model explicit. Indeed, when considering a new user , a new cluster containing user only is created with probability , and user is associated with an existing cluster with probability proportional to the number of users already assigned to this cluster. Refer to [15] for a more detailed description on DP mixture models.

Our prediction method simply consists in approximating by the expectation w.r.t. the posterior distribution . In other words, for user , the estimated next position will be:

 ^xu∈argmaxj∈LEg[θuxunu,j|xU], (4)

where denotes the expectation w.r.t. the probability measure induced by . To compute , we rely on Gibbs sampling techniques to generate samples with distribution . The way concentrates its mass around the true prior will depend on the choice of parameters and , and to improve the accuracy of our predictor, these parameters will be constantly updated when successive samples are produced.

### 4.2 CAMP: Cluster-Aided Mobility Predictor

Next we present CAMP, our mobility prediction algorithm. The objective of this algorithm is to estimate from which we derive the predictions according to (4). CAMP consists in generating independent samples of the assignment of users to clusters induced by the posterior distribution , and then in providing an estimate of from these samples. As mentioned above, the accuracy of this estimate strongly depends on the choice of parameters and in the DP mixture model, and these parameters will be updated as new samples are generated.

More precisely, the CAMP algorithm consists in two steps. (i) In the first step, we use Gibbs sampler to generate samples of the assignment of users to clusters under the probability measure induced by , and update the parameters and of the DP mixture model using these samples (hence we update the prior distribution ). We repeat this procedure times. In the -th iteration, we construct samples of users’ assignment. The -th assignment sample is referred to as in CAMP pseudo-code, where is the cluster of user in that sample. The subroutines providing the assignment samples, and updating the parameters of the prior distribution are described in details in §4.2.1 and §4.2.2, respectively. At the end of the first step, we have constructed a prior distribution parametrized by and which is adapted to the data, i.e., a distribution that concentrates its mass on the true prior . (ii) In the second step, we use the updated prior to generate one last time samples of users’ assignment. Using these samples, we compute an estimate of for each user , and finally derive the prediction of the next position of user . The way we compute is detailed in §4.2.3.

The CAMP algorithm takes as inputs the data , the number of updates of the prior distribution , the number of samples generated by the Gibbs sampler in each iteration, and the number of times the users’ assignment is updated when producing a single assignment sample using Gibbs sampler (under Gibbs sampler, the assignment is a Markov chain, which we simulate long enough so as it has the desired distribution). , , and have to be chosen as large as possible. Of course, increasing these parameters also increases the complexity of the algorithm, and we may wish to select the parameters so as to achieve an appropriate trade-off between accuracy and complexity.

#### Sampling from the DP mixture posterior

We use Gibbs sampler [16] to generate independent samples of the assignment of users to clusters under the probability measure induced by the posterior , i.e., samples of assignment with distribution , where denotes the probability measure induced by . Gibbs sampling is a classical MCMC method to generate samples from a given distribution. It consists in constructing and simulating a Markov chain whose stationary state has the desired distribution. In our case, the state of the Markov chain is the assignment , and its stationary distribution is . The Markov chain should be simulated long enough (here the number of steps is denoted by ) so that at the end of the simulation, the state of the Markov chain has converged to the steady-state. The pseudo-code of the proposed Gibbs sampler is provided in Algorithm 2, and easily follows from the description of the DP mixture model provided in (3).

To produce a sample of the assignment of users to clusters, we proceed as follows. Initially, we group all users in the same cluster , the number of cluster is set to 1, and the number of users (except for user ) assigned to cluster is . (see Algorithm 2). Then the assignment is revised times. In each iteration, each user is considered and assigned to either an existing cluster, or to a newly created cluster (the latter is denoted by if in the previous iteration there was clusters). This assignment is made randomly according to the model described in (3). Note that in the definition of , we have , where corresponds to the data of users in cluster , i.e., .

#### Updates of G0 and α

As in any Bayesian inference method, our prediction method could suffer from a bad choice of parameters and defining the prior . For example, by choosing a small value for , we tend to get a very small number of clusters, and possibly only one cluster. On the contrary, selecting a too large would result in a too large number of clusters, and in turn, would make our algorithm unable to capture similarities in the mobility patterns of the various users. To circumvent this issue, we update and fit the parameters to the data, as suggested in [8]. In the CAMP algorithm, the initial base distribution is uniform over all transition kernels (over ) and is taken equal to 1. Then after each iteration, we exploit the samples of assignments of users to clusters to update these initial parameters, by refining our estimates of and .

Note that (5) simply corresponds to a kernel density estimator based on the cluster samples obtained with prior distribution parametrized by and , whereas (6) corresponds to a maximum likelihood estimate (see [17]), which sets to the value which is most likely to have resulted in the average number of clusters obtained when sampling from the model with parameters and .

#### Computation of ^θu

As mentioned earlier, is an estimator of where is parameterized by and , and is used for our prediction of user-’s mobility. is just the empirical average of for clusters to which user- is associated in the last samples generated in CAMP, i.e.,

 ^θu = 1BB∑b=1Eg[¯θcu,b,K|xcu,b,K] (7) = 1BB∑b=1∫θθ⋅Pθ(xcu,b,K)GK0(dθ)∫θPθ(xcu,b,K)GK0(dθ). (8)

Note that in view of the law of large numbers, when grows large, converges to . The predictions for user are made by first computing an estimated transition kernel according to (8). We derive an explicit expression of that does not depend on , but only on data and the samples generated in the CAMP algorithms. This expression, given in the following lemma, will be useful to understand to what extent the prediction of user-’s mobility under CAMP leverages observed trajectories of other users.

###### Lemma 1

For any is computed by a weighted sum of all users’ empirical transition kernels (), i.e.,

 ^θui,j = ηi+∑v∈Uγvinvi,jnvi, (9) where ηi = ∑c1..cK:u∈cKξc1..cK1|L|+∑Kk=1ncki|U|nKcKK∏k=1ωkck, (10) γvi = ∑c1..cK:u∈cKξc1..cKnvi∑Kk=1\mathbbm1(v∈ck)|L|+∑Kk=1ncki|U|nKcKK∏k=1ωkck.

The sum stands for , and is the set of every cluster sampled at -th iterations (i.e., ). and are given by:

 ξc1..cK=∏i∈L∏j∈LΓ(1+∑k=1..Kncki,j)Γ(|L|+∑k=1..Kncki),
 ωKc=nKcB|U|∑c1..cK−1ξc1..cK−1,cK−1∏k=1ωkck,

where , , and .

Proof. Refer to Appendix.

When the current location is fixed, the first term in the r.h.s. of (9) is constant over all users. The second term can be interpreted as a weighted sum of the empirical transition kernels of all users (i.e., ). The weight of user ( in (9)) quantifies how much we account for user-’s trajectory in the prediction for user at the current location , and can be seen as a notion of similarity between and . Indeed, as the number of sampled clusters in which both and are involved increases, in (9) increases accordingly. Also, if has relatively high compared to other users (i.e., has accumulated more observations at the location than other users), a higher weight is assigned to .

#### Estimating the Staying-times

Next we provide a way of estimating how long user will stay at her current location . We may perform such estimation when the available data include the time users stay at the various locations. Typically, the existing spatio-temporal predictors predict the staying time at the current location by computing average [5] or -quantile [3] of user ’s staying times observed at her previous visits to On the other hand, CAMP additionally exploits other users’ staying time observations using the weight . More precisely, the staying time of user at location (denoted by ) is estimated by

 ^sunu=∑v∈Uzγvi1nvi∑t:xvt=isvt, where i=xunu. (11)

in (11) is a normalization constant to make the sum of weights over all users equal to 1. The estimate (11) is a heuristic, for is actually obtained by clustering based on their location trajectories , rather than their staying times. This heuristic estimate actually performs well as empirically shown in Section 6.2.4.

## 5 Consistency of CAMP Predictor

In this section, we analyze to what extent (that is well approximated, when is large, by derived in the CAMP algorithm) is close to , the expectation under the true prior . We are mainly interested in the regime where the user population becomes large, while the number of observations for each user remains bounded. This regime is motivated by the fact it is often impractical to gather long trajectories for a given user, while the user population available may on the contrary be very large. For the sake of the analysis, we assume that the length of user-’s observed trajectory is a random variable with distribution , and that the lengths of trajectories are independent across users. We further assume that the length is upper bounded by , e.g., .

Since the length of trajectories is bounded, we cannot ensure that is arbitrarily small. Indeed, for example if users’ trajectories are of length 2 only, we cannot group users into clusters, and in turn, we can only get a precise estimate of the transition kernels averaged over all users. In particular, we cannot hope to estimate for each user . Next we formalize this observation. We denote by the set of possible trajectories of length less than . With finite-length observed trajectories, there are distributions that cannot be distinguished from the true prior by just observing users’ trajectories, i.e., these distributions induce the same law on the observed trajectories as : on (here denotes the probability measure induced under , and recall that is the probability measure induced by ). We prove that, when the number of observed users grows large, is upper-bounded by the performance provided by a distribution indistinguishable from , which expresses the consistency of our inference framework. Before we state our result, we introduce the following two notions:
KL -neighborhood: the Kullback-Leibler -neighborhood of a distribution with respect to is defined as the following set of distributions:

 Kϵ,¯¯¯n(μ)={ν∈P(Θ):KL¯¯¯n(μ,ν)<ϵ},

where .
KL support: The distribution is in the Kullback-Leibler support of a distribution with respect to if for all .

###### Theorem 2

If is in the KL-support of with respect to , then we have, -almost surely, for any ,

 lim|U|→∞∣∣Eg[θui,j|XU]−E[θui,j|Xu]∣∣ (12)
 ≤supν∈P(Θ)Pν=P on H¯n∣∣Eν[θui,j|Xu]−E[θui,j|Xu]∣∣.

Proof. Refer to Appendix.

The r.h.s. of (2) captures the performance of an algorithm that would perfectly estimate for the worst distribution which agrees with the true prior on Note that in our framework, for the prior , we use is a DP mixture , with a base measure having full support . Therefore, the KL-support of is here the whole space ; it thus contains .

As far as we are aware, Theorem 2 presents the first performance result on inference algorithms using DP mixture models with indirect observations. By indirect observations, we mean that the kernels cannot be observed directly, but are revealed only through the trajectories . Most existing analysis [18, 19, 20] do not apply in our setting, as these papers aim at identifying conditions on the Bayesian prior and on the true distribution under which the Bayesian posterior will converge (either weakly or in -norm) to in the limit of large population size. Hence, existing analysis are concerned with direct observations of the kernels .

## 6 Empirical Evaluation of CAMP

### 6.1 Mobility Traces

We evaluate the performance of CAMP predictor using two sets of mobility traces collected on a Wi-Fi and cellular network, respectively.

Wi-Fi traces [21]. We use the dataset of [21] where the mobility of 62 users are collected for three months in Wi-Fi networks mainly around a campus in South Korea. The smartphone of each users periodically scans its radio environment and gets a list of mac addresses of available access points (APs). To map these lists of APs collected over time to a set of locations, we compute the Jaccard index4 between two lists of of APs scanned at different times. If two lists of APs have a Jaccard index higher than 0.5, these two lists are considered to correspond to a same geographical locations [21]. From the constructed set of locations, we then construct the trajectories of the various users.

ISP traces [22]. We also use the call detailed record (CDR) dataset provided by Orange where the mobility of 50000 subscribers in Senegal are measured over two weeks. We use the SET2 data [22], where the mobility of a given user is reported as a sequence of base station (BS) ids, and time stamps. Each record is obtained only when the user communicates with base stations (e.g., phone call, text message).

In each dataset, we first restrict our attention to a subset of frequently visited locations. We select the 116 and 80 most visited locations in Wi-Fi traces and ISP traces datasets, respectively. We then re-construct users’ trajectories by removing locations not in . For the ISP dataset, we extract 200 users (randomly chosen among users who visited at least 10 of the locations in ). From the re-constructed trajectories, we observe a total number of transitions from one location to another equal to 8194 and 13453 for the Wi-Fi and ISP dataset.

Users’ similarity. Before actually evaluating the performance of various prediction algorithms, we wished to assess whether users exhibit similar mobility patterns, that could in turn be exploited in our predictions. Here, we test the similarity of pairs of users only. More precisely, we wish to know whether the observed trajectory of user could be aggregated to that of user to improve the prediction of user-’s mobility. To this aim, we use the concept of mutual prediction [23] as follows.

We first define the empirical accuracy of an estimator of user-’s transition kernel:

 ¯¯¯¯¯¯¯¯ACu(^θ)=1nu−1nu∑t=2\mathbbm1(xut=argmaxj^θxut−1,j) (13)

Let be the maximum likelihood estimator of given (i.e., ). Intuitively, user-’s trajectory is useful to predict the mobility of user if has a high empirical accuracy for user , i.e., if is high. We hence define the similarity of users and as . Note that the notion of similarity is not symmetric (in general ), and it always takes its value between 0 and 1.

Fig. 1 (a) and (b) present the similarity between 62 users in Wi-Fi trace and 100 users in the ISP subscriber dataset. To provide meaningful plots, we have ordered users so that pairs of users with high similarity are neighbours (to this aim, we have run the spectral clustering algorithm [11] and re-grouped users in the identified clusters). From these plots, the similarity of users is apparent, however we also clearly observe that perfect clusters (in which users’ patterns are exactly same) do not really exist. From the dataset, we observe that 1.65% and 5% of user pairs out of all possible pairs have similarity higher than 0.5 for the Wi-Fi and ISP traces. We also computed the number of users having at least one user with whom the similarity is higher than 0.5. In the Wi-Fi traces, we found 19 (out of 62) such users, whereas in the ISP traces there are 173 (out of 200) such users. These numbers are high, and justify the design of cluster-aided predictors.

### 6.2 Prediction Accuracy

#### Tested Predictors

We assess the performance of six types of predictors: the order-1 Markov predictor (Markov [4]), the order-2 Markov predictor with fallback (Markov-O(2) [4]), AGG, CAMP and CAMP, AGG. Before describing each predictor, we briefly introduce some notations regarding the training data available at a given time. The time stamp of the arrival at -th location on user-’s trajectory is denoted by , and is the length of user-’s trajectory collected before time (i.e., ). The collection of users’ trajectories available for a prediction at time is denoted by (i.e., where ). The prediction for is denoted by

In order to derive an estimate of the -th location of user , the Markov predictors first estimate based on user- trajectory only, i.e., based on . In contrast, AGG and CAMP algorithms exploit the data available on all users to estimate . The AGG algorithm tries in a very naive way to exploit users’ similarities. It considers that all users have the same transition kernel (as if there were a single cluster only), and thus uses all trajectories (in the same way) to estimate . CAMP (resp. AGG) differs from CAMP (resp. AGG) in that its prediction at time under for user uses other users’ complete trajectories (i.e., ). This corresponds to a case where user starts moving along her trajectory after other users have gathered sufficiently long trajectories. Under all algorithms, the estimated is denoted by ). Finally, Markov-O(2) assumes that users’ trajectories are order-2 Markov chains, and for the locations where the corresponding order-2 transitions are not observed, Markov-O(2) falls back to the Markov predictor. The description of the various predictors is summarized in Table 1.

The parameters , and for CAMP and CAMP are set to 8, 3 and 30.

#### Results

We assess the performance of the various algorithms using two main types of metrics. The first metric, referred to as the Cumulative Accurate Prediction Ratio (CAPR), is defined as the fraction of accurate predictions for all users up to time :

 CAPRtime=1∑u∈U(nu(d)−1)∑u∈Unu(d)∑s=2\mathbbm1(^xus=xus).

We also introduce a similar metric that captures the cumulative accuracy of predictions after observing different locations on users’ trajectories:

 CAPR=1(t−1)∑u∈U\mathbbm1(nu≥t)∑u∈Unu≥tt∑s=2\mathbbm1(^xus=xus).

The second type of metrics concerns the instantaneous accuracy of the predictions. The Instantaneous Accurate Prediction Ratio (IAPR) after observing different locations on users’ trajectories is defined as follows.

 IAPR=1∑u∈U\mathbbm1(nu≥t)∑u∈U,nu≥tnuxut−1,^xutnuxut−1.

Fig.h(a)-(b) present as a function of time for various algorithms and for the two mobility traces. CAMP outperforms all other algorithms at any time. The improvement over Markov and Markov-O(2) can be as high as 65%. This illustrates the performance gain that can be achieved when exploiting users’ similarities. Note Markov-O(2) does not outperform Markov, which was also observed in [3]. In the following, we only evaluate the performance of the Markov predictor, and do not report that of its order-2 equivalent.

In Fig.h (c)-(f), we plot the CAPR and IAPR as a function of the length of the observed trajectory. In Fig.h(c) and (d), when the collected trajectory is not sufficient (i.e., ), CAMP and CAMP outperforms Markov by 64% and 40%, respectively. Regarding the IAPR in Wi-Fi traces, Fig h(e) shows that CAMP and CAMP provide much better predictions than Markov, when the length of trajectory is less than 140. After a sufficient training data is collected, they yield comparable IAPR. In Fig h (f), for the ISP traces, the IAPR under CAMP and Markov are similar sooner, for trajectories of length greater than 20 only.

In Fig.h (g) and (h), we evaluate the CAPR and IAPR averaged only over users having at least one user with whom the similarity is higher than 0.5 (see §6.1). These users are referred to as Mobility Friendly (MF) users. In Fig.h(g), we observe that for MF users, the gain of CAMP and CAMP becomes really significant, i.e., when =10, the CAPR of CAMP and CAMP outperform that of Markov by 102% and 65%, respectively. Also note that CAMP becomes significantly better than CAMP for MF users. This is explained by the fact that we can predict the mobility of MF users much more accurately if we have a long history of the mobility of users they are similar to. The performance for MF users in the ISP traces is not presented, because there, most of users (i.e., 86%) are already MF users.

#### Exploiting Similarities in CAMP

Recall that, by the weight of the empirical transition kernel of user (i.e., ) in computing in (9), we can quantify to what extent the observed trajectory of user is taken into account in the estimate . When summing over all locations , we get an aggregate indicator capturing how impacts the prediction for user-’s mobility. To understand how many users actually impact the prediction for user in the CAMP, we may look at the cardinality of the set of users whose aggregate indicator exceeds a given threshold: where is a normalization constant to make the sum of aggregate indicators over all users equal to 1. The above set is called the set of -similar users.

In Fig.3, we plot the number of -similar users, averaged over all users , and as a function of the length of trajectories (in days ). In case of CAMP, the first day, the average numbers are 7 and 110 in Wi-Fi traces and ISP traces, which means that CAMP aggressively uses the trajectories of all users for its prediction. When the length of the trajectories increase, the average size decreases to 1.5 after one month in Wi-Fi traces and 2.2 after two weeks in ISP traces. In other words, as data is accumulated, CAMP does not use the trajectories of a lot of users for its prediction. This illustrates the adaptive nature of CAMP, which only exploits similarities among users if this is needed. In the case of CAMP, we observe a faster decrease with time of the average number of -similar users, which means that CAMP tends to utilize other users’ data more selectively, even at the beginning. This explains why CAMP performs better than CAMP in Fig.h.

#### Error of Staying Time Estimation

In our scenario, where each user arrives at -th location , a predictor estimates the staying time with the available data. Markov predictor [3, 5](resp. AGG) computes the average of staying times of user (resp. all users) which have been measured at the location until . CAMP predicts by computing the equation (11) with the observed data of all users. The performance metric for each user measured at -th location is the difference between the estimated and acutual staying time ( i.e., ). We call it as estimation error. We test the estimation error only with Wi-Fi trace, because we cannot precisely observe staying time in ISP trace in which a location is recorded not periodically, but only when users randomly communicate with base stations.

Fig.b (a) plots CDFs of estimation errors of every user and obtained by tested predictors. CAMP provides lower estimation error than that of Markov and AGG. The median of CAMP is less than those of Markov and AGG by 35% and 28%, respectively. For 18% of all instances (marked as “Estimation failure”), Markov couldn’t provide estimations, because the individual users haven’t collected their staying times at the current location before. However in those cases AGG and CAMP are still able to estimate the staying time by using other users’ observations. In Fig.b (b), we further test the estimation quality of AGG and CAMP, when Markov is unavailable due to lack of the individual training data. In that case, 43% of estimations provided by CAMP give less than 30 minutes errors. Median of estimation errors of CAMP is 13.4% less than that of AGG, because CAMP selectively utilizes other users’ data.

## 7 Concluding Remarks

In this paper, we have presented a cluster-aided inference method to predict the mobility of users in wireless networks. This method significantly departs from existing prediction techniques, as it aims at exploiting similarities in the mobility patterns of the various users to improve the prediction accuracy. The proposed algorithm, CAMP, relies on Bayesian non-parametric estimation tools, and is robust and adaptive in the sense that it exploits users’ mobility similarities only if the latter really exist. We have shown that our Bayesian prediction framework can asymptotically achieve the performance of an optimal predictor when the user population grows large, and have presented extensive experiments indicating that CAMP outperforms any other existing prediction algorithms. Note also that CAMP can be implemented without damaging users’ privacy (the data can be anonymized).

Many interesting questions remain about the design of CAMP. In particular, we plan to investigate how to set its parameters (, , and ) to achieve an appropriate trade-off between accuracy and complexity. These parameters could also be modified in an online manner while the algorithm is running to adapt to the nature of the data. We further plan to apply the techniques developed in this paper to various kind of mobility, e.g., we could investigate how users dynamically browse the web, and use our framework to predict the next visited webpage.

## Appendix

### 7.1 Proof of Lemma 1

Observe that in view of (5), we have:

 Gk+10(dθ)=∑cωkcPθ(xc)Gk0(dθ), (14)

where the sum is over all possible partitions of the set of users in clusters and the weight is

 ωkc=nkcB|U|∫θPθ(xc)Gk0(dθ), (15)

with . Recursively replacing in (14) with and putting = Uniform(), we obtain another expression of as

 GK0(dθ)=∑c1,…,cK−1K−1∏k=1ωkckPθ(xck)dθ, (16)

where the sum is where is a set of every cluster sampled at -th iterations, i.e., . We can further obtain the recursive expression of the weights by plugging (16) in (15):

 ωKc=nKcB|U|