Achieving Perfect Location Privacy in Wireless Devices Using Anonymization

Achieving Perfect Location Privacy in Wireless Devices Using Anonymization

Zarrin Montazeri  Amir Houmansadr  Hossein Pishro-Nik  Z. Montazeri is with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, 01003 USA e-mail: (seyedehzarin@umass.edu).A. Houmansadr is with the College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, 01003 USA e-mail:(amir@cs.umass.edu)H. Pishro-Nik is with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA, 01003 USA e-mail:(pishro@engin.umass.edu)This work was supported by National Science Foundation under grants CCF 0844725 and CCF 1421957. Parts of this work was presented in Annual Conference on Information Science and Systems (CISS 2016) [1], and in International Symposium on Information Theory and Its Applications (ISITA 2016) [2].
Abstract

The popularity of mobile devices and location-based services (LBS) has created great concern regarding the location privacy of their users. Anonymization is a common technique that is often used to protect the location privacy of LBS users. Here, we present an information-theoretic approach to define the notion of perfect location privacy. We show how LBS’s should use the anonymization method to ensure that their users can achieve perfect location privacy.

First, we assume that a user’s current location is independent from her past locations. Using this i.i.d model, we show that if the pseudonym of the user is changed before observations are made by the adversary for that user, then the user has perfect location privacy. Here, is the number of the users in the network and is the number of all possible locations that users can go to.

Next, we model users’ movements using Markov chains to better model real-world movement patterns. We show that perfect location privacy is achievable for a user if the user’s pseudonym is changed before observations are collected by the adversary for the user, where is the number of edges in the user’s Markov chain model.

Location Privacy, Mobile Networks, Information Theoretic Privacy, Anonymization, Location Privacy Protecting Mechanism (LPPM), Markov Chains.

I Introduction

Mobile devices offer a wide range of services by recording and processing the geographic locations of their users. Such services, broadly known as location-based services (LBS), include navigation, ride-sharing, dining recommendation, and auto-collision warning applications. While such LBS applications offer a wide range of popular and important services to their users, they impose significant privacy threats because of their access to the location information of these wireless devices. Privacy compromises can also be launched by other types of adversaries including third-party applications, nearby mobile users, and cellular service providers.

To protect the location privacy of LBS users, various mechanisms have been designed [3, 4, 5], which are known as Location-Privacy Protection Mechanisms (LPPM). These mechanisms tend to perturb the information of wireless devices, such as the user’s identifier or location coordinates, before they get disclosed to the LBS application. LPPMs can be classified into two classes; those that perturb the user’s identity information are known as identity perturbation mechanisms while those that perturb the location information of the users are known as location perturbation mechanisms. Improving the location privacy of users using these LPPMs usually comes at the price of performance degradation for the underlying LBS’s, so finding the optimal LPPM with respect to the LBS is still a great concern.

In the proposed framework, we employ the anonymization technique to hide the identity of users over time. First, we assume that each user’s current location is independent from her past locations in order to simplify the derivations. Then, we model the users’ movements by Markov chains which is a more realistic setting by considering the dependencies between locations over time. When dealing with privacy, it is advisable to assume the strongest model for the adversary, so in this paper we assume that the adversary has complete statistical knowledge of the users’ movements. We formulate a user’s location privacy based on the mutual information between the adversary’s anonymized observations and the user’s actual location information. We define the notion of perfect location privacy and show that with a properly designed anonymization method, users can achieve perfect location privacy.

Parts of this work has been previously presented as two conference publications [1, 2]. In this manuscript, we extend our previous work in [1, 2] by offering new results, analysis, and perspective. In particular, we provide a clean-slate proof for Theorem 1 to make sure all parts of the proof are presented in a rigorous way. More specifically, the old proof in [1] was only presented in a summary format; a few lemmas were stated without proof or with just a sketch of a proof. On the other hand, here we provide the new proof along with all the required details. Also, we have revised and extended our discussions. For example, Section IV is added to the paper to clearly elaborate on the problem setting in the paper.

It is worth noting that this paper focuses on the theoretical foundations of the location privacy problem when anonymization is used as an LPPM mechanism. Needless to say, when implementing anonymization-based LPPMs in practice our analysis needs to get adjusted to each scenario’s specific threat model, e.g., the number of mobile entities, the capabilities of the adversary (the number and location of observation points, prior knowledge about mobile entities, etc.), the extent of possible geographic locations, etc.

Ii Related Work

Location privacy has been an active field of research over the past decade [6, 7, 8, 9, 10, 11, 12, 13, 14, 3]. Studies in this field can be classified into two main classes: 1) Designing effective LPPMs for specific LBS systems and platforms, 2) Deriving theoretical models for location privacy, e.g., deriving metrics to quantify the location privacy.

The designed LPPMs can be classified into two classes: 1) Location perturbation LPPMs, 2) Identity perturbation LPPMs. Location perturbation LPPMs aim at obfuscating the location information of the users over time and geographical domain with methods such as cloaking, [6, 10], and adding dummy locations, [11, 12]. On the other hand, identity perturbation LPPMs try to obfuscate the user’s identity while using an LBS. Some common approaches to perturb the identity of the user is to either exchange users’ identifiers, [15], or assign random pseudonyms to them, known as anonymization technique, [16, 17]. The former method usually uses some pre-defined regions, called mixed-zones, to exchange users’ identifiers within those regions. As users cross such regions, they exchange their identifiers with other users in the same region using an encryption protocol to confuse the adversary, [18, 19].

Previous studies have shown that using anonymization alone is not enough to protect users’ location privacy in real-world scenarios where users go to unique locations. Particularly, Zang et al. demonstrate that an adversary has a significant advantage in identifying users who visit unique locations [20]. Also, Golle and Partridge show that the possibility of user identification based on anonymized location traces is significantly increased when the individual’s home and work locations are known [21]. Please note that this does not contradict the analysis and findings of our paper as we use a different setting. First, our analysis seeks to find the theoretical limits of privacy for situations where the number of users () goes to infinity, which is not the case in previous studies like [21, 20]. Increasing the number of user reduces an adversary’s confidence in distinguishing different users. Second, in our analysis we assume “continuous” density functions for the movements of the users across different locations (e.g., the function in Section LABEL:sec:model). Therefore, user distributions do not contain Dirac delta functions representing their unique locations. Note that this is not an unrealistic assumption; in real-world scenarios with users having unique locations, we assume that the location information is pre-processed to satisfy this continuity requirement. Such pre-processing can be performed in two ways; first, by reducing the granularity of locations, e.g., in our analysis we divide a region of interest into a number of girds (i.e., into coarse-grained locations). Second, an obfuscation mechanism can be applied to location traces to ensure they satisfy the continuity requirement. Further discussion on implementing such pre-processing is out of the scope of our work and we leave it to future work.

A related, but in parallel, approach to our study is differential privacy-based mechanisms. Differential privacy is mainly studied in the context of databases containing sensitive information, where the goal of differential privacy is to respond queries on the aggregation of the information in the database without revealing sensitive information about the individual entries in the database. Differential privacy has been extensively studied in the context of location privacy, i.e., to prevent data leakage from location information databases [8, 22, 23, 24, 25, 26]. The goal here is to insure that the presence of no single user could noticeably change the outcome of the aggregated location information. For instance, Ho et al. [27] proposed a differentially private location pattern mining algorithm using quadtree spatial decomposition. Some location perturbation LPPMs are based on ideas from differential privacy [28, 29, 30, 31, 23]. For instance, Dewri [32] suggest to design obfuscation LPPMs by applying differential perturbations. Alternatively, Andres et al. hide the exact location of each user in a region by adding Laplacian distributed noise to achieve a desired level of geo-indistinguishability [31]. Note that our approach is entirely in parallel with this line of work. Our paper tries to achieve the theoretical limits on location privacy—independent of the LPPM mechanisms being used—while differential privacy based studies on location privacy try to design specific LPPM mechanisms under very specific application scenarios.

Several works aim at quantifying the location privacy of mobile users. A common approach is called K-anonymity, [4, 16]. In K-anonymity, each user’s identity is kept indistinguishable within a group of other users. On the other hand, Shokri et al. [13, 14] define the expected estimation error of the adversary as a metric to evaluate LPPMs. Ma et al. [33] use the uncertainty of the users’ location information to quantify the location privacy of the user in vehicular networks. Li et al. [34] define metrics to show the tradeoff between the privacy and utility of LPPMs.

Wang et al. tried to protect the privacy of the users for context sensing on smartphones, using Markov decision process (MDP) [35]. The adversary’s approach and user’s privacy preserving mechanism is changing during time. The goal is to obtain the optimal policy of the users.

Previously, the mutual information has been used as a privacy metric in different topics, [36, 37, 38, 39, 40]. However, in this paper we use the mutual information specifically for location privacy. For this reason, a new setting for this privacy problem will be provided and discussed. Specifically, we provide an information theoretic definition for location privacy using the mutual information. We show that wireless devices can achieve provable perfect location privacy by using the anonymization method in the suggested way.

In [41], the author studies asymptotically optimal matching of sequences to source distributions. However, there are two key differences between [41] and this paper. First, [41] looks only at the optimal matching tests, but does not consider any privacy metric (i.e., perfect privacy) as considered in this paper. The major part of our work is to show that the mutual information converges to zero so we can conclude there is no privacy leak (hence perfect privacy). Also, the setting of [41] is different as it assumes a fixed distribution on sources (i.e., classical inference) as we assume the existence of a general (but possibly unknown) prior distributions for the sources (i.e. a Bayesian setting).

Iii Framework

Iii-a Defining Location Privacy

In the proposed framework, we consider a region in which a large number of wireless devices are using an LBS. To support their location privacy, the anonymization technique is being used by the LBS. An outsider adversary is interested in identifying users based on their locations and movements. We consider this adversary to be the strongest adversary that has complete statistical knowledge of the users’ movements based on the previous observations or other resources. The adversary has a model that describes users’ movements as a random process on the corresponding geographic area.

Let be the location of user at time , and be the number of users in our network. The location data of users can be represented in the form of the following stochastic processes:

The adversary’s observations are anonymized versions of the ’s produced by the anonymization technique. She is interested in knowing for based on her anonymized observations for each of the users, where is a function of , e.g., . Thus, at time , the data shown in the box has been produced:

The goal of this paper is to find the function in a way that perfect privacy is guaranteed. Let be a collection of anonymized observations available to the adversary. That is, is the anonymized version of the data in the box. We define perfect location privacy as follows:

Definition 1.

User has perfect location privacy at time , if and only if

where shows the mutual information between and .

The above definition implies that over time, the adversary’s anonymized observations do not give any information about the user’s location. The assumption of large , (), is valid for almost all applications that we consider since the numbers of users for such applications are usually very large.

In order to achieve perfect location privacy, we only consider anonymization techniques to confuse the adversary. In particular, the anonymization can be modeled as follows:

We perform a random permutation , chosen uniformly at random among all possible permutations on the set of users, and then assign the pseudonym to user

Throughout the paper, we may use instead of for simplicity of the notation.

For , let X be a vector which shows the user’s locations at times :

Using the permutation function , the adversary observes a permutation of users’ location vectors, ’s. In other words, the adversary observes

(1)

where Perm(.) shows the applied permutation function. Then,

Iv Example

Here we provide a simple example to further elaborate the problem setting. Assume that we have only three users, , and five locations, , that users can occupy (Figure 1). Also, let’s assume that the adversary can collect observations per user. Each user creates a path as below:

user path
user1
user2
user3
Fig. 1: An area is divided into five regions that users can occupy.

To anonymize the users, we will assign a pseudonym to each. The pseudonyms are determined by the function defined by a random permutation on the user set:

For this example, suppose that the permutation function is given by , , and . The choice of the permutation is the only piece of information that is not available to the adversary. So here, the adversary observes anonymized users and their paths:

and she wants to find which user (with the pseudomym ) actually made , and so on for the other users. Based on the number of observations that the adversary collects for each user, , and also the statistical knowledge of the users’ movements, she aims at breaking the anonymization function and de-anonymizing the users. The accuracy of this method depends on the number of observations that the adversary collects, and thus our main goal in this paper is to find the function in a way that the adversary is unsuccessful and the users have perfect location privacy.

V i.i.d Model

V-a Perfect Location Privacy for a Simple Two-State Model

To get a better insight about the location privacy problem, here we consider a simple scenario where there are only two states to which users can go, states and . At any time , user has probability to be at state , independent from her previous locations and other users’ locations. Therefore,

To keep things general, we assume that ’s are drawn independently from some continuous density function, , on the interval. Specifically, there are such that111The condition is not actually necessary for the results and can be relaxed; however, we keep it here to avoid unnecessary technicalities.

The values of ’s are known to the adversary. Thus, the adversary can use this knowledge to potentially identify the users. Note that our results do not depend on the choice of and we do not assume that we know the underlying distribution . All we assume here is the existence of such distribution. The following theorem gives a general condition to guarantee perfect location privacy:

Theorem 1.

For two locations with the above definition and anonymized observation vector of the adversary, , if all the following holds

  1. , for some positive constants and ;

  2. ;

  3. ;

  4. be known to the adversary;

then, we have

i.e., user has perfect location privacy.

Note that although the theorem is stated for user 1, the symmetry of the problem allows it to be restated for all users. Also note that the theorem is proven for any . Therefore, roughly speaking, the theorem states that if the adversary obtains less than observations per user, then all users have location privacy.

V-B The Intuition Behind The Proof

Here we provide the intuition behind the proof, and the rigorous proof for Theorem 1 is given in Appendix A. Let us look from the adversary’s perspective. The adversary is observing anonymized locations of the first user and she wants to figure out the index of the user that she is observing, in other words she wants to obtain from . Note that the adversary knows the values of . To obtain , it suffices that the adversary obtains . This is because , so

Since is a Bernoulli random variable with parameter , to do so, the adversary can look at the averages

In fact, ’s provide sufficient statistics for this problem. Now, intuitively, the adversary is successful in recovering if two conditions hold:

  1. .

  2. For all , is not too close to .

Now, note that by the Central Limit Theorem (CLT)

That is, loosely speaking, we can write

Consider an interval such that falls into that interval and length of , , is chosen to be

where (Remember was defined by the equation in the statement of Theorem 1) . Note that goes to zero as becomes large. Also, note that for any , the probability that is in is larger than . In other words, since there are users, we can guarantee that a large number of ’s fall in since we have

On the other hand, note that

Note that here, we will have a large number of normal random variables whose expected values are in the interval (that has a vanishing length) with high probability and their standard deviations are much larger than the interval length (but equal to each other asymptotically, i.e. ). Thus, distinguishing between them will become impossible for the adversary. In other words, the probability that the adversary will correctly identify goes to zero as goes to infinity. That is, the adversary will most likely choose an incorrect value for . In this case, since the locations of different users are independent, the adversary will not obtain any useful information by looking at . Of course, the above argument is only intuitive. The rigorous proof has to make sure all the limiting conditions work out appropriately. This has been accomplished in Appendix A.

V-C Extension to -States Model

Here, we extend our results to a scenario in which we have locations, . At any time , user has probability to be at location , independent from her previous locations and other users’ locations. At any given time , shows the probability of user being at location and vector contains ’s for all the locations

We assume that ’s for are drawn independently from some dimensional continuous density function on the . In particular, define the range of distribution as

Then, we assume there exists positive constants such that

For example, Figure 2 shows the range for the case where there are three locations, .

Fig. 2: for case .
Theorem 2.

For locations with the above definition and the adversary with an observation vector , if all the following holds

  1. , which and are constant

  2. ,

  3. is known to the adversary

then, we have

i.e., user has perfect location privacy.

Proof of the Theorem 2 is analogous to the proof of Theorem 1. Here, we provide the general intuition. We do not provide the entire rigorous proof as it is for the most part repetition of the arguments provided for Theorem 1 in Appendix A.

Let be equal to . As you can see in Figure 3, for three locations there exists a set such that is in that set and we have

We choose , where . Thus, the average number of users with p vector in is

so we can guarantee that a large number of users are in the set . This can be done exactly as in the proof of Theorem 1 using Chebyshev’s Inequality.

Fig. 3: is in set in .

Here, the number of times a user is at each location follows a multinomial distribution and in the long run, these numbers have a jointly gaussian distribution asymptotically as goes to infinity. The standard deviation of these variables are in the form of . Moreover, the standard deviation over length of this interval is also large

Again, we have a large number of asymptotically jointly normal random variables that have a much larger standard deviation compared to the differences of their means. Thus, distinguishing between them becomes impossible.

This suggests that it is impossible for the adversary to correctly identify a user based on her observations even though he knows P and . So, all the users have perfect location privacy. The proof can be made rigorous exactly the same way we did for the proof of Theorem 2, so we do not repeat the details here.

Vi Markov Chain Model

Assume there are possible locations to which users can go. We use a Markov chain with states to model movements of each user. We define , the set of edges in this Markov chain, such that is in if there exists an edge from to with probability .

We assume that this Markov structure chain gives the movement pattern of each user and what differentiates between users is their transition probabilities. That is, for fixed locations and , two different users could have two different transition probabilities. For simplicity, let’s assume that all users start at location (state) , i.e., for all . This condition is not necessary and can be easily relaxed; however, we assume it here for the clarity of exposition. We now state and prove the theorem that gives the condition for perfect location privacy for a user in the above setting.

Theorem 3.

For an irreducible, aperiodic Markov chain with states and edges, if , where and are constants, then

(2)

i.e., user 1 has perfect location privacy.

Proof.

Let be the number of observed transitions from state to state for user . We first show that ’s provide a sufficient statistic for the adversary when the adversary’s goal is to obtain the permutation . To make this statement precise, let’s define as the matrix containing ’s for user :

Also, let be the ordered collection of ’s. Specifically,

The adversary can obtain , a permuted version of . In particular, we can write

We now state a lemma that confirms is a sufficient statistic for the adversary, when the adversary’s goal is to recover . Remember that is the collection of anonymized observations of users’ locations available to the adversary.

Lemma 1.

Given , the random matrix and the random permutation are conditionally independent. That is

(3)

Lemma 1 is proved in the Appendix B.

Next note that since the Markov chain is irreducible and aperiodic, when we are determining ’s, there are degrees of freedom, where is equal to . This is because for each states , we must have

Thus, the Markov chain of the user is completely determined by values of ’s which we show as

and ’s are known to the adversary for all users. Note that the choice of is not unique; nevertheless, as long as we fix a specific , we can proceed with the proof. We define as the set of edges whose ’s belong to . Let be the range of acceptable values for . For example, in Figure 4 we have and , so we have three independent transitions probabilities. If we choose , , and according to the figure, we obtain the following region

Fig. 4: Three states Markov chain example

The statistical properties of each user are completely known to the adversary since she knows the Markov chain of each user. The adversary wants to be able to distinguish between users by having observations per user and also knowing ’s for all users.

In this model, we assume that for each user is drawn independently from a -dimensional continuous density function, . As before, we assume there exist positive constants , such that

We now claim that the adversary’s position in this problem is mathematically equivalent to the the i.i.d model where the number of locations is equal to where . First, note that since the Markov chain is irreducible and aperiodic, it has a unique stationary distribution which is equal to the limiting distribution. Next, define to be the vector consisting of all the transition probabilities of user . In particular, based on the above argument, we can represent in the following way:

where is a non-random by matrix. Now, note that is a non-random function of . In particular, if shows the observed number transitions from state to state for user , then we only need to know for the edges in , as the rest will be determined by the linear transform defined by . This implies that the decision problem for the adversary is reduced to the decision problem on transition probabilities in and the adversary only needs to look at the ’s for the edges in . Now, this problem has exactly the same structure as the i.i.d model where the number of locations is equal to where . In particular, ’s have multinomial distributions and the statement of Theorem 3 follows by applying Theorem 2.

Discussion: One limitation of the above formulations is that all users must have the same Markov chain graphs with only different transition probabilities. In reality, there might be users that have different Markov chain graphs. For example, we might have users that never visit a specific region. Nevertheless, we can address this issue in the following way. If we are considering the location privacy of user , we only consider users that have the same Markov chain graph (but with different transitions probabilities). The other users are easily distinguishable from user anyway. Now, would be the total number of this new set of users and again we can apply Theorem 3. If is not large enough in this case, then we need to use location obfuscation techniques in addition to anonymization to achieve perfect privacy.

Vii Conclusion

We presented an information theoretic definition for perfect location privacy using the mutual information between the users’ actual locations and the anonymized observations that the adversary collects. First, we modeled users’ movements to be independent from their previous locations. In this model, we have users and locations. We prove that if the number of anonymized observations that the adversary collects, , is less than then users will have perfect location privacy. So, if the anonymization method changes pseudonyms of users before observations is made by the adversary for each user, then the adversary cannot distinguish between users and they can achieve perfect location privacy. Then, we modeled users’ movements using Markov chains so that their current locations affect their next moves. We proved that for such a user, perfect location privacy is achievable if the pseudonym of the user is changed before number of observations is made by the adversary.

Appendix A Proof of Theorem 1 (Perfect Location Privacy for Two-State Model)

Here, we provide a formal proof for Theorem 1. In the proposed setting, we assume we have an infinite number of potential users indexed by integers, and at any step we consider a network consisting of users, i.e., users . We would like to show perfect location privacy when goes to infinity. Remember that shows the location of user at time .

In a two-state model, let us assume we have state 0 and state 1. There is a sequence for the users. In particular, for user we have for times . Thus, the locations of each user are determined by a Bernoulli () process.

When we set as the number of users, we assume to be the number of adversary’s observations per user,

So, we have if and only if .

As defined previously, contains number of user ’s locations and is the collection of ’s for all users,

The permutation function applied to anonymize users is (or simply ). For any set , we define

The adversary who knows all the ’s, observes anonymized users for number of times each and collects their locations in

where .

Based on the assumptions of Theorem 1, if the following holds

  1. , which and are constant

  2. ,

  3. be known to the adversary,

then we want to show

i.e., user 1 has perfect location privacy and the same applies for all other users.

A-a Proof procedure

Steps of the proof are as follows:

  1. We show that there exists a sequence of sets with the following properties:

    • if then, as

    • let be any sequence such that then

  2. We show that

  3. Using 2, we conclude

    and in conclusion,

A-B Detail of the proof

We define for to be the number of times that user was at state 1,

Based on the assumptions, we have . One benefit of ’s is that they provide a sufficient statistic for the adversary when the adversary’s goal is to obtain the permutation . To make this statement precise, let’s define as the vector containing , for :

Note that

Thus, the adversary can obtain , a permuted version of , by adding the elements in each column of . In particular, we can write

We now state and prove a lemma that confirms is a sufficient statistic for the adversary when the adversary’s goal is to recover . The usefulness of this lemma will be clear since we can use the law of total probability to break the adversary’s decision problem into two steps of (1) obtaining the posterior probability distribution for and (2) estimating the locations given the choice of .

Lemma 2.

Given , the random matrix and the random permutation are conditionally independent. That is

(4)
Proof.

Remember

Note that (and therefore ) is an by matrix, so we can write

where for , we have

Also, is a by vector, so we can write

We now show that the two sides of Equation 4 are equal. The right hand side probability can be written as

Now note that

Similarly, we obtain

Thus, we conclude that the right hand side of Equation 4 is equal to

Now let’s look at the left hand side of Equation 4. First, note that in the left hand side probability in Equation