3PS - Online Privacy through Group Identities

3PS - Online Privacy through Group Identities

Pól Mac Aonghusa and Douglas J. Leith P. Mac Aonghusa is with IBM Research and Trinity College Dublin.D.J. Leith is with Trinity College Dublin.
Abstract

Limiting online data collection to the minimum required for specific purposes is mandated by modern privacy legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Protection Act. This is particularly true in online services where broad collection of personal information represents an obvious concern for privacy. We challenge the view that broad personal data collection is required to provide personalised services. By first developing formal models of privacy and utility, we show how users can obtain personalised content, while retaining an ability to plausibly deny their interests in topics they regard as sensitive using a system of proxy, group identities we call 3PS. Through extensive experiment on a prototype implementation, using openly accessible data sources, we show that 3PS provides personalised content to individual users over of the time in our tests, while protecting plausible deniability effectively in the face of worst-case threats from a variety of attack types.

Personal Privacy, Plausible Deniability, Group Identities, Recommender Systems, Web Search.

I Introduction

Gathering and analysing data about user interests and behaviours is arguably the de facto business model for the free-to-use internet. Personalisation to enhance user experience is offered as a general motivation for broad data collection. The numbers are impressive. Facebook earned an average of US$4.65 per user from personalised content such as advertising and promoted posts in the second quarter of 2017, according to the Economist [economist2017]. By comparison, an average of just US$0.08 per user came from direct fees such as payments within virtual games.

In this paper we ask a natural question, is it true that much less personal information than is currently collected is sufficient to provide an effective personalised service? Recent legislation, such as the General Data Protection Regulation (GDPR), mandates that personal data must be adequate, relevant and limited to what is necessary in relation to the purposes for which those data are processed, [GDPR:2016]. In this respect, broad collection of user data without transparent purpose in online interactions with everyday commercial systems is a particular concern for individual privacy.

We consider users of everyday online commercial systems where personalised content, tuned to user interests, is displayed during interactions. The privacy model considered here is based on plausible deniability of likely interest in topics an individual user regards as sensitive. We show that, by adopting the persona of an appropriate group containing many users, an individual user can gain a good degree of personalisation while successfully limiting personal data disclosure. The use of group identities as a proxy technique provides a natural “hiding in the crowd” form of privacy comparable to techniques such as k-anonymity, so that a user can plausibly deny their interests in topics they deem sensitive. This model is intuitive for users to understand and, importantly, to appreciate its limitations.

Our contributions include a novel proxy agent framework we call 3PS for Privacy Preserving Proxy Service, where a user may protect their interests in sensitive topics from unwanted personalisation by submitting queries though a pool of group identities called Proxy Agents. We also formalise notions of personalisation utility and privacy detection and test these experimentally using openly available data-sets. We show that user privacy need not come at the cost of reduced utility in personalised services when aggregated group information represented by the proxy agent pool is sufficient for personalisation.

The 3PS framework is designed to be simple to deploy with minimal technical disruption to existing systems. We provide a privacy preserving algorithm for selecting group membership of proxy agents. By running the selection algorithm locally, users can find the group identity best matching their interests without revealing their interests. Through extensive experimental verification we show that our method of selecting group membership is both accurate - selecting the group identifier closest in topical interests with average accuracy across all experiments - and converges rapidly within input–output iterations on average.

Personal privacy is fundamentally a risk management exercise where there is an ongoing responsibility on users to take reasonable care. There are no absolute guarantees and individuals must strike their own balance between privacy and utility. Our results suggest that using group identities such as 3PS can provide effective and verifiable privacy protection for responsible users without overly degrading the personalisation capability of the underlying backend system.

Ii Privacy and Personalisation Models

Ii-a General Setup

We consider a setup where users interact with a system by by submitting an input and receiving an output in response. Each interaction between a user and consists of an input–output pair, referred to as an input–output interaction or step. We assume that user inputs and system outputs are each decomposable into features. For example, when modelling a user querying movies or hotels the input features might consist of keywords, or if assigning ratings the features might be clicks. An ordered list of features with no duplicate entries is called a dictionary. We let and denote the dictionary containing valid input features to , and valid output features generated by respectively. Individual features are indicated thus, and so that denotes the feature in and the feature in . We let and denote the sets of possible valid inputs and outputs comprised of combinations of features from and respectively, and the set of valid input–output interactions is .

We gather a sequence of consecutive input-output interactions between a user and into a session. Input–output interactions may repeat during a session and so sessions are represented as sequences of input–output interactions. We use set notation to improve readability when working with sequences when the meaning as applied to sequences is clear. We denote the overall sequence of input–output interactions generated by users up to step by , where denotes the element of and is the background knowledge available before the first interaction is observed. The subsequence of input–output interactions associated with user up to step is denoted by where denotes the element of and is the background knowledge available about before is observed.

Each user has a private labelling function which associates input–output interactions in with topic labels selected from a private, user-defined set of labels . We adopt the convention that the label is identified with a catch-all “non-sensitive” category while the remaining elements in label individual “sensitive” topics such as “health” or “finances”. The user labelling function is private and labels every input–output pair in with at least one topic from . Mostly we are simply interested in whether an input-output pair is sensitive or not for a user, in that case we define to be the indicator function with when input-output pair , , i.e. is labelled with a sensitive topic by user , and otherwise.

Let denote the subsequence of observations originating from user that are labelled with topic . The sequence is the subsequence of observations in that user has labelled as sensitive. We let denote the subsequence of that would label with topic . The sequence is the subsequence of that would be labelled as sensitive by user . The sequence contains items from users other than . Consequently, while , it is not generally the case that is a subsequence of .

We assume that user labelling functions are well-behaved in the following sense:

Assumption (Meaningful Labelling)

An input-output pair which is labelled as non-sensitive by a user is truly non-sensitive for that user e.g. the user would be content for it to be shared publicly.

Assumption II-A requires users to strike their own balance between utility and privacy. The low risk strategy of simply labelling every input-output pair as sensitive implies that the user may not be able to use the system at all. For example, if the system is a dating service, the knowledge that a person uses the system necessarily reveals their interest in such a service. A user choosing to use the system cannot include such system-level topics in their sensitive set. The implicit statement in Assumption II-A is that users form an individual judgement regarding the inference capabilities of observers and to accept a degree of risk associated with this judgement call proving incorrect.

Ii-B Privacy and Threat Model

Our interest is in privacy attacks where an attacker seeks to infer topics of likely interest to users of online systems. An attacker is successful when users are unable to deny their interest in a topic on the balance of probabilities. Here attackers have access to input–output interactions . By analysing the attacker attempts to estimate topics that are of likely interest to . The privacy model here is plausible deniability, allowing users to reasonably deny that observations are solely associated with topics they deem sensitive. We formalise plausible deniability in our context as follows:

Definition (δ-Plausible Deniability)

A user can plausibly deny their input–output observations are associated with topics they deem sensitive if 111In this case denotes .

 \rm P(z∈Zu,ck|z∈Zatt,k)≤δ (1)

where the deniability parameter, , is chosen by and is the background knowledge of an attacker at step of a session.

This differs from the ()–Plausible Deniability model introduced in [mac.P2] where an individual user claimed plausible deniability because an input–output observation from that user could be associated with any of several topics.

Observe that

 \rm P(z∈Zu,ck|z∈Zatt,k) \lx@stackrel(a)≤\rm P(z∈Zu,ck∩Zk)\rm P(z∈Zk)\rm P(z∈Zk)\rm P(z∈Zatt,k) (2) \lx@stackrel(b)=\rm P(z∈Zu,ck|z∈Zk)\rm P(z∈Zatt,k|z∈Zk) (3)

where inequality follows from the facts that and , and equality follows since . Hence, for -plausible deniability to hold it is sufficient that

 \rm P(z∈Zu,ck|z∈Zk)≤δ\rm P(z∈Zatt,k|z∈Zk) (4)

From (4), when an observer has access to all of the observations in the system so that and then it is sufficient to have for -plausible deniability to hold. In the case that the observer is able to make observations at a more local level, so that , then (4) implies that is required for -plausible deniability to hold. Consequently, unless the user can plausibly deny that they contributed to , we have

Observation (Power of Observers)

Observers represent more powerful threats when they have access to more localised sequences of input–output interactions so there is some trade-off involved in locality versus deniability.

Ii-C Comparison with Other Privacy Models

In the group identity setup considered here, the intention is to deny interest by hiding sensitive user activity in the overall activity of users of shared group identifiers. The setup here can be compared with other privacy models. We show briefly how this is done in the cases of two common models of privacy, Differential Privacy, [dwork2006differential], and Individual Re-identification, [sweeney2000simple].

Ii-C1 Re-identification

Re-identification risk occurs when an attacker, possessing observations , can assert that sensitive input–output interactions generated by user are identified with probability greater than for . In other words, when

 \rm P(z∈Zu,ck∩Zu,k|z∈Zatt,k)>1−ϵ (5)

for .

If -plausible deniability holds (1) guarantees

 \rm P(z∈Zu,ck∩Zu,k|z∈Zatt,k)≤δ (6)

since . Consequently (1) prevents re-identification of those sensitive input–output interactions with probability at least .

Ii-C2 Differential Privacy

Recall that a query mechanism satisfies -differential privacy [dwork2006differential] if, for any two sequences of length differing in one element, and any set of output values , we have

 \rm P(M(D1)∈S)≤eϵ\rm P(M(D2)∈S)+γ (7)

One important class of mechanisms are those where sequences in are first perturbed, e.g. by adding noise, and then queries are answered. It is this approach which is effectively adopted here, with the perturbations being introduced by the randomness of the process generating the input–output interactions. An attacker observes a sequence of input–output interactions and seeks to associate a label with one or more input–output interactions, namely whether or not they were likely to be generated by a target user and are sensitive for that user. Consider therefore the query i.e. which labels input-output pair as when it is sensitive for user and labels it otherwise. This is a worst case query in the sense that it assumes the attacker knows the labelling function , and when this is not the case the labelling accuracy will obviously be degraded. Let be two input-output sequences such that , where denotes the ’th element of sequence and similarly for i.e. sequences and are identical except for the ’th element. Mechanism is -differentially private provided

 p1≤eϵp2+γ , p2≤eϵp1+γ (8) 1−p1≤eϵ(1−p2)+γ , 1−p2≤eϵ(1−p1)+γ (9)

where

 p1:=\rm P(lu(D1(j))=1), p2:=% \rm P(lu(D2(j))=1) (10)

are the probabilities that input-output pair in sequence , respectively , is labelled sensitive by user . For sequences satisfying the -plausible deniability condition (1) we have and . It can be verified that the -differential privacy conditions (8)-(9) are therefore satisfied for and .

The privacy model described here is concerned with attacks at the application layer that seek to link input–output interactions and associated topics to individual user interests. Linking attacks targeting other vectors are also possible.

One vector for attack is for the service provider to attempt to place cookies or third-party tracking content on the web pages viewed by a user. Within the EU, the GDPR rules require that users be explicitly informed of such actions and must take a positive step to opt in. Hence attempts at such tracking seem like a relatively minor concern. Outside the EU, existing tools for blocking third-party trackers can be used, leaving the setting of unique identifying first party cookies as the main concern. This can be mitigated by standard approaches e.g. by activists maintaining lists of cookies that can be safely used (similar to existing lists of malware sites, trackers and so on) and users blocking the rest.

Another possible vector of attack is to record the IP address of the user browser, and thereby try to link the ratings back to the individual user. However, due to the widespread use of techniques such as VPN or NAT, use of IP addresses as identifiers is unreliable. Users also have the option of using tools such as TOR to further conceal the link between the IP address revealed to the server and the users identity. Such tools are the subject of an extensive literature in their own right and are complementary to the present discussion.

The parties here are sometimes referred to as observers, rather than attackers, since the relationships here are not fundamentally adversarial being rather of the honest but curious variety. Since our main interest is in honest but curious attackers we exclude active attacks against the UI and user devices from consideration, which are, of course, the subject of an extensive literature in its own right.

Iii The 3PS Architecture

The challenge is to construct an online system which satisfies Definition II-B, thereby providing -plausible deniability to users, while also providing an effective personalised service. We propose an architecture, which we refer to as 3PS, whereby users access the system through a pool of group identities referred to as proxy agents. This is illustrated schematically in Figure 1. The 3PS architecture therefore consists of three interacting parties denoted as follows:

• An online system . is a black-box in the sense that only inputs to, and outputs from, are observable while details of the internal workings of are hidden from users.

• A pool of Proxy Agents acting as Group Identities, routing queries to, and output responses from . In effect each group identity is an account used to access the system, with this account being shared by multiple users.

• A pool of users who can submit input to, and receive corresponding output responses from, via the group identities provided by the proxy agents in .

In the 3PS architecture the proxy agent pool is controlled by the backend service. One key reason for doing this is to ensure that proxy agent IDs are recognised as genuine users by the backend system. If not recognised as bona fide users the proxy agents may be flagged as a bot or robot and so trigger defences, such as “captchas”, or even be blocked. Other than acknowledging the proxy agents as legitimate users, the 3PS system is intended to be backwards compatible and does not require significant engineering changes in the backend system.

Iii-a Providing Personalisation

The backend system is assumed to generate recommendations for a proxy agent based on profiling interests in topics as it would for any other user. In a shared proxy setup users inherit the shared profile of the proxy agent they choose. A user accessing via the pool of proxy agents and wishing to obtain good recommendations should therefore choose the proxy agent whose interests most closely match their interests. As an example, Figure (a)a and Figure (b)b show the results of issuing the query “cheap flights” through two different proxy agent setups. The choice of query is deliberately intended to trigger commercial advertising for illustrative purposes. In Figure (a)a the proxy agent is dedicated to Google Search users located in a single country, Ireland. In Figure (b)b the proxy agent is a web-proxy gateway shared by Google Search users from many countries.

The response via the proxy agent in Figure (a)a contains significantly more content than the proxy agent in Figure (b)b. Content in Figure (a)a is also more localised to the region of the user, as illustrated by the Google flight search box outlined in red on the figure and in the Ireland “.ie” domains on other results. Content obtained from the shared proxy agent in Figure (b)b by contrast reflects the regional settings of the proxy agent rather than the user – in this case, UK currency and websites appear in the adverts.

To obtain personalised content, each user chooses a proxy agent closest to their interests in the sense that it is a solution to

 minp∈P∑c∈C|\rm P(z∈Zu,cu,k|z∈Zu,k)−\rm P(z∈Zu,cu,k|z∈Zp,k)| s.t. \rm P(z∈Zu,ck|z∈Zp,k)≤δ (11)

where denotes the input–output interactions of all users with proxy . The constraint in (11) ensures that -plausible deniability holds for an observer with access to .

Iii-B Threat Models

By varying the observations, , available to an observer it is possible to model classes of attack encompassing the system itself and observers with access to more localised background knowledge. We introduce two observer classes we will use in the remainder of this paper.

Iii-B1 Privacy Against A Global Observer

A global observer denotes an attacker where . That is, with access to all of the input–output interactions for the entire system up to the present step . A global observer does not have knowledge of the user labelling function but can try to cluster the observed input–output interactions to infer topics of likely interest. This class of attacker encompasses the system itself, external parties such as advertising partners and attackers obtaining data by hacking of the system. Provided (1) holds for then a user has -plausible deniability against global observers.

Iii-B2 Privacy Against A Proxy Observer

We also consider a proxy observer, namely a global observer who also has knowledge of the set of proxy agents used by user . Hence, a proxy observer knows that the input–output interactions generated by user are contained in the subsequence

 Zatt,k=(z∈Zk:ιp(z)=1,p∈Pu) (12)

where indicator function equals for input–output interactions submitted via proxy and otherwise. From Observation II-B, a proxy observer is a more powerful attacker than a global observer by having access to more localised data. Provided (1) holds with given by (12) then a user has -plausible deniability against proxy observers.

Iii-C Mitigating Sybil Attacks

Attacks by dishonest users who submit false inputs in an attempt to manipulate the outputs of the system are outside the scope of the present paper. Although this is an important challenge for all online systems it is not specific to 3PS. That said, the use of shared proxies and unlink-ability of input–output interactions to individual users does potentially facilitate Sybil attacks and so we briefly describe one mechanism, based on the work of [chaum1988untraceable], where such attacks can be disrupted while being compatible with the 3PS setup. In summary, each user mints a number of session tokens (with associated serial number), blinds them with a secret blinding factor and forwards them to the 3PS system through a non-secure channel. The number of tokens available to a user is limited e.g. by requiring users to authenticate or make payment to the service in order to forward a token, or perhaps by limiting the number of tokens allowed within a certain time window. Note that during this phase the user might be identified to the system, e.g. to make a payment. The system then signs the tokens with its private key, without knowledge of the serial number associated with the tokens. On receiving the signed tokens back from the system, the user can remove the blinding factor and use the tokens to submit inputs to the system anonymously. Double use of tokens is prevented by the system maintaining a database of the serial numbers of all tokens that have been issued.

Iv Prototype Implementation

In this section we describe an experimental implementation of a backend recommender system accepting text queries as inputs and producing text-based outputs. It is not intended to be a fully working system but rather a proof of concept implemented as software that is sufficient to demonstrate the feasibility of 3PS and to illustrate how personalisation and privacy verification might be implemented. In the prototype implementation the internal state of simulated users, proxy agents and the backend system can be inspected for measurement during test. This allows us to conveniently compare probability estimators during experiments that would be private in a production system.

Iv-a Personalisation

In the prototype implementation inputs and outputs are sequences of words and the dictionaries, and , consisting of common keywords appearing in the input and output respectively. We adopt a standard bag–of–words language model [manning1999foundations] where features in an input–output pair are modelled as being drawn independently, with replacement, and ignoring order, according to the mixture model

 \rm P(z∈Ak|z∈Bk) =|DX|∑i=1|DY|∑j=1\rm P(z∈Ak|{θXi,θYj}∈z)\rm P({θXi,θYj}∈z|z∈Bk) (13)

where are non-empty sequences of observations, [Hofmann:1998:SMC:888741]. The quantity is the probability that an input-output pair belongs to subsequence given the keywords co-occur in . Similarly is the probability that keywords co-occur in given that belongs to subsequence .

Expression (13) can be applied directly to (11) so that

 \rm P(z∈Zu,cu,k|z∈Zu,k)−\rm P(z∈Zu,cu,k|z∈Zp,k) =|DX|∑i=1|DY|∑j=1\rm P(z∈Zu,cu,k|{θXi,θYj}∈z)(a) ×⎛⎜ ⎜ ⎜⎝\rm P({θXi,θYj}∈z|z∈Zu,k)(b)−\rm P({θXi,θYj}∈z|z∈Zp,k)(c)⎞⎟ ⎟ ⎟⎠ (14)

and the minimisation element of (11) becomes a calculation over the term labelled (c) in (14). We will return to the constraint element of (11) later.

Term (14)(a) is the only element of the RHS of (14) that depends on knowledge of the user labelling function . Since (14)(a) and (14)(b) do not depend on they can be estimated privately by . To allow (14) to be privately by a user, it is sufficient for each proxy agent to release the probability distribution (14)(c) publicly. With this a user can construct (14).

Expression (14) consists of matrix multiplications of matrices of size . The proxy selection condition in (11) can be solved efficiently in practice by estimating the various probabilities.

Iv-B Estimating Probabilities

To estimate probabilities in our prototype implementation, user applies their private labelling function to label each input–output pair for topics in . Let and denote the labelled inputs and outputs of respectively. Apply count-vectorisation to each element of and and gather the result into count-matrices and of size and respectively. Since , the quantity is of dimension . is the count co-occurrence matrix of input–output interactions of input–output features in labelled for topic . The –element of matrix , denoted , is the co-occurence count of the features in labelled for topic . We apply regular Laplace Smoothing, [Manning:2008:IIR:1394399], to avoid divide by zero underflows in subsequent computations when there are sparse occurrences of keywords in . Laplace smoothing resolves this problem by adding a factor to each keyword count so that . The quantity

 ˆ\rm P({θXi,θYj}∈z|z∈Zu,cu,k)=Nc,ijNc Nc=|DX|∑i=1|DY|∑j=1Nc,ij (15)

is then an estimator for . Similarly, an estimator for is given by

 ˆ\rm P({θXi,θYj}∈z|z∈Zu,k)=NijN (16) N=∑c∈CNc,Nij=∑c∈CNc,ij

and

 ˆ\rm P(z∈Zu,cu,k|z∈Zu,k)=NcN (17)

is an estimator for the probability of an observation being labelled for topic .

Let have components given by

 Oij(z)={1 if ϕXi(x)>0 and ϕYj(y)>0 for z={x,y}0 otherwise

and define

 Oc,ij:=∑z∈Zu,cu,k Oij(z),Oc:=|DX|∑i=1|DY|∑j=1Oc,ij and, O:=∑c∈COc

so that an estimator for is

 ˆ\rm P(z∈Zu,cu,k|{θXi,θYj}∈z)=Oc,ijOc (18)

and an estimator for

 ˆ\rm P(z∈Zu,k|{θXi,θYj}∈z)=∑c∈COc,ijO (19)

For a proxy agent , let and denote the inputs and outputs in respectively. Apply count-vectorisation to each element of and and gather the result into count-matrices and respectively of size and respectively. The quantity , of dimension , is the count co-occurrence matrix of input–output interactions of input–output features in , to which Laplace smoothing is applied. We estimate for each proxy agent as

 ˆ\rm P({θXi,θYj}∈z|z∈Zp,k)=MijM,M=|DX|∑i=1|DY|∑j=1Mij (20)

and denotes the –element of matrix .

Expressions (16), (18) and (20) can then be combined, to estimate the RHS of (14) for each user .

In our experimental setup, it is convenient to estimate plausible deniability directly from the definition (1) as

 Δu,catt,k:=ˆ\rm P(z∈Zu,ck|z∈Zatt,k)=|z∈Zatt,k:lu(z)=c||z∈Zatt,k| (21)

The probability of user observing an input–output pair labelled with topic when accessing through proxy agent is . This is estimated in our experimental setup as

 ˆ\rm P(z∈Zu,cp,k|z∈Zp,k)=|z∈Zp,k:lu(z)=c||z∈Zp,k| (22)

and , the probability of user observing an input–output pair labelled with topic when accessing directly is estimated as

 ˆ\rm P(z∈Zu,cu,k|z∈Zu,k)=|z∈Zu,k:lu(z)=c||z∈Zu,k| (23)

We measure the estimated utility loss incurred by user as a result of selecting proxy agent , using (22) and (23), as

 ΔUu,cp,k:=12∑c∈C|ˆ% \rm P(z∈Zu,cu,k|z∈Zu,k)−ˆ\rm P% (z∈Zu,cp,k|z∈Zp,k)| (24)

that is, the total variation between the sensitive topic probability estimator the user would calculate if they used directly and the probability estimator of the topic calculated by the proxy agent they used.

Iv-C User Estimate of Privacy Threat

The challenge for a user in checking (1) is that it requires knowledge of by user . So that is required to know the history of input–output interactions for each sensitive topic for all users in the 3PS system.

In the prototype implementation we use the approach that each user has defined a set, , for each sensitive topic , consisting of input–output keywords whose presence means an input–output observation is labelled as sensitive by . In experiments, , is selected for each user and topic using the training data to choose the keyword pairs for which

 Θu,cu,k(α)={{θXi,θYj}:{ˆ\rm P(z∈Zu,cu,k|{θXi,θYj}∈z)>α} (25)

where is a parameter chosen using cross-validation.

For each topic define the associated indicator function over observations and , as

 (26)

That is, the indicator function labels an observation as sensitive if it contains an input–output keyword pair from and non-sensitive otherwise. Using the bag-of-words model to combine this with the published estimator provided by each proxy agent we get an estimator for given by

 ˆ\rm Pα(z∈Zu,ck|z∈Zp,k)= ∑|DX|i=1∑|DY|j=1ιcα({θXi,θYj}|z)ˆ\rm P({θXi,θYj}∈z|z∈Zp,k)∑c∈C∑|DX|i=1∑|DY|j=1ιcα({θXi,θYj}|z)ˆ\rm P({θXi,θYj}∈z|z∈Zp,k) (27)

In a real-world setup it is up to the user to decide how to select . For example, the PRI tool developed in [mac.P1] and [mac.P2] allows a user to analyse input–output observations for privacy threats and so assess which keyword pairs are more or less revealing of sensitive topics. In this way tools such as PRI can provide information to assist in constructing in a real-world setup.

V Experimental Setup

V-a General Setup

In our experimental setup, the test datasets, described later, are labelled with a set of topics . Before an experimental run each user and proxy agent simulated during the experiment is allocated a topic of interest from . When a user or proxy agent is allocated the non-sensitive, catch-all topic we will say the user or proxy agent is randomly initialised meaning that they have no interest in a specific sensitive topic. We call the percentage of proxy agents in or users in that have been randomly initialised the diversity of or . During experiments we will typically report results for , and diversity in and/or .

At the start of each experimental run, each user and each proxy agent is allocated initial data consisting of input–output pairs from the test dataset labelled for their allocated topic of likely interest, referred to as background knowledge. Each user and proxy agent in the simulation has a copy of the common dictionaries and from . Next, each user and each proxy agent estimates initial values of the probabilities in Section IV-B from the initial background knowledge using and . We refer to these probabilities as the internal state of the user or proxy agent. An input query is a keyword in drawn from at random by .

Users select a proxy agent best matching their allocated topic of interest by solving (11). When a proxy agent receives an input query from a user it passes it directly to . Since the set of topics is known to in our experiments, creates a personalised response by solving , to find the topic of maximum likely interest from given the input it received, and then selecting an output labelled for . The resulting output is returned to the proxy agent. The input–output interaction pair is added to the background knowledge of the proxy agent and its internal state is updated with new probability estimates. The output is routed to the requesting user and the same input–output interaction is added to its background knowledge and its internal state and probability estimator are updated.

Background knowledge is not shared among users and proxy agents. When a user switches to a different proxy agent during an experimental run, the user history of input–output interactions does not transfer to the new proxy agent so that individual proxy agents see only the history of interactions from users accessing through it. A full reset is performed between test runs by re-initialising the entire setup.

V-B Data Sources

Data from three real-world sources are used in experiments.

Hotels

Tripadvisor hotel reviews containing hotel review titles, review bodies and lowest price per room downloaded from, [wang.data], and consisting of over million hotel reviews. Queries consisting of words extracted from review titles are used as inputs and detailed review bodies represent outputs.

Products

Product review titles, review bodies and overall rating scores downloaded from, [wang.data], containing Amazon product reviews for types of merchandise and consisting of over million product reviews. Words appearing in product review titles are used as query inputs and outputs review bodies.

Search

Web search queries and corresponding result pages relevant to sensitive topics “weight loss”, “anorexia”, “diabetes”, “bad credit history”, “pregnancy” used in [mac.P1] ‘and [mac.P2] and comprising Google searches constructed by gathering search terms from the Wikipedia article related to each sensitive topic and from the top web search queries appearing on www.Soovle.com for the non-sensitive queries. Here the queries submitted to Google are the inputs with the corresponding result pages taken as outputs.

V-C Assigning Topics

Default topics for experiments were defined as follows from each of the test datasets.

Hotels

Five topic categories are defined by dividing the lowest price per room into equally spaced ranged, namely . Reviews are then labeled according to the lowest price.

Products

The overall rating score is used to define topic categories, namely very dissatisfied (Topic 1) to very satisfied (Topic 4). Topic is used to indicate no rating was given so there are topic categories in total.

Search

There are topic categories labelled “Other”, “weight loss”, “anorexia”, “diabetes”, “bad credit history”, “pregnancy” as in [mac.P1, mac.P2] Each input–output pair is labelled with the topic the input query refers to.

When experiments are performed requiring a larger number of topics than those above, the Hotels dataset is divided into the required number of topic categories by specifying different lowest price ranges. In this way it is possible to create a variety of topic categories automatically by re-grouping the data into finer price categories to create more topic categories. The Hotel dataset was chosen for convenience since the categories are defined by numeric, price-per-room, ranges and so it is straightforward to programatically define more categories by changing the numeric ranges.

V-D Revealing Keyword Pairs

Each of the test datasets was preprocessed using the text processing described in Section IV-B to produce dictionaries and for each dataset. A range of dictionary sizes from to features was assessed by selecting random subsequences and choosing the dictionaries that minimise

 | ˆ\rm P(z∈Ak|z∈Zk) −|DX|∑i=1|DY|∑j=1ˆ\rm P(z∈Au,ck|{θXi,θYj}∈z)ˆ%P({θXi,θYj}∈z|z∈Zk)| (28)

From this we selected and for our experiments.

The distribution of keyword pairs in samples drawn from each of the three test datasets is shown in Figure 3 by topic. Average values were calculated by taking samples each of items from each of the test datasets. Error bars in Figure 3 indicates variance from sampling. In the case of all datasets and for all topics, the co-occurrence frequency of the majority of keyword pairs fall below . The rarest keyword pairs by topic, and hence the most revealing, have co-occurrence frequencies greater than . These keyword pairs comprise less than of the total keyword pairs, suggesting that the most revealing keyword pairs form a small subset in the case of all datasets.

Vi Experimental Evaluation

Vi-a Topic Diversity and User Numbers

We assess the effects of topic diversity and user numbers for the case consisting of a single proxy agent and a single sensitive topic. We denote the senstive topic so that where is the catch-all topic. A single proxy agent setup means so that results here apply to both proxy and global observers.

Tests were repeated with and of users having and the remainder having . We report results for , and users for compactness. Results are averaged by dataset and error about the mean is shown as a shaded region. Plausible deniability, from (21), and utility loss, from (24), averaged over users, are shown in Figure 4. Plausible deniability is plotted in the first row and utility loss in the second row.

From (1), a user has better plausible deniability for lower values of since is an upper bound. Our results suggest that increasing user numbers decreases and so improves plausible deniability but only when users have varied interests. Once users have a diverse range of interests, increasing the number of users is observed to accelerate improvement in plausible deniability. For utility loss, increasing volumes of users without specific interests is observed to increase utility loss. When all users of a proxy agent have no specific topic interests so that diversity is high this is reflected in increased utility loss relative to topic as one might expect.

Vi-B Personalisation Performance

In 3PS users select proxies closest to their interests but the responses generated by proxy agents also change as users submit queries via them. We would like this joint selection/update process to converge so as to achieve good personalisation performance. In this section we use our prototype implementation to evaluate this process. Experimental setups with proxy pools of sizes and numbers of users were configured for each of the test datasets. We initialise proxy agents in randomly so that there is no automatic choice of best proxy agent–user match. Users are allocated a sensitive topic as their target topic from the set of topics in each of the test datasets. Each user applies (11) to select a proxy agent best matching their target topic by enumerating each proxy agent in in turn. Users only submit queries related to the their allocated topic of interest so that noise due to diverse topic interests of users is controlled in the setup here to focus on convergence properties. Once a proxy agent is selected a user issues a query related to their topic of interest and the internal states of users and proxy agents are updated accordingly. Results are reported as averages over and and topic for compactness and shown in Figure 5.

The measured accuracy of (11) for proxy agent selection is shown in the LHS plot of Figure 5. Proxy agent selection is deemed to be accurate when a user chooses a proxy agent whose allocated topic of most likely interest matches the allocated target topic of the user. The RHS of Figure 5 is the utility loss, calculated from (24), taken at each input–output step. For visual clarity, standard error is shown for the average utility loss over all datasets.

Utility loss is high and accuracy is low initially reflecting the fact that the initial internal state of proxy agents is randomly set. Convergence to the proxy agent with closest interests is observed to happen quickly for all data sources, achieving at least accuracy for all datasets after iterations with a corresponding average utility loss of . When averaged over all data sources the average accuracy is after input–output steps. The utility loss is also observed to decrease for all topics over time, reaching an average across all datasets of after input–output iterations and by iteration .

Users are observed to select the correct proxy agent with greater than accuracy, and to reject all proxy agents with accuracy if there is no suitable proxy agent available. Overall, in experiments where the ratio of users to proxy agents was increased from to , the utility loss is observed to decrease more slowly as the average number of users attaching to each proxy agent increases. When the ratio of users to proxy agents was , for example, the average utility loss on step 1 was . Convergence to a low utility loss was also observed to be rapid, even at high user to proxy agent ratios, reaching after 4 input–output steps when the user to proxy agent load factor was .

The number of topic categories was also varied by regrouping the Hotel dataset. High proxy agent selection accuracy was consistently observed, with accuracy of greater than after step 3. The utility loss was also observed to decrease rapidly to less than after 4 input–output steps, reaching minimum of less than by iteration on average over all topics.

Overall, the results suggest that the proxy agent selection method converges rapidly and accurately, providing a high degree of personalisation. Utility loss also decreases rapidly as more topic specific input–output events are observed. This is consistent across the test datasets, and for a range of user–to–proxy agent ratios, suggesting that the proxy agent selection mechanism performs well across a variety of setups.

Vi-C Plausible Deniability

We next assess the degree of plausible deniability protection available to users with respect to a proxy observer when there are multiple proxy agents. We also assess how diversity in user topic interests influences plausible deniability and utility loss. Since a proxy observer is at least as powerful as a global observer the results here provide worst-case bounds in the face of a global observer. Experimental setups with proxy pools of sizes and numbers of users were configured for each of the test datasets. Each proxy agent was allocated a topic as their topic of interest. Each user was allocated with a target topic of interest with setups of and of users having to model various levels of diversity of topic interests among users. Results are reported as averages over and and topic for compactness and shown in Figure 6 and Figure 7.

In Figure 6 we show measurements of estimated level of plausible deniability. We show estimates of calculated directly from (21), together with the values of the estimator (27) calculated using as the set of sensitive keywords. To model the situation where the user has partial or censored dictionaries and in experiments, we show measurements for values for .

The results shown in Figure 6 indicate that plausible deniability is observed to improve monotonically as diversity of user interest in topics increases. This is true when either (21) or (27) are used as estimators, for all values of . The estimated value using (27) is consistently lower than the corresponding estimation from (21) for all values of tested.

Figure 7 illustrates the trade-off between improved privacy and utility loss. Increasing utility loss is observed in all cases as the fraction of users with diverse topic interests increases as the “signal-to-noise” ratio of coherent interests to random interests decreases. This is observed when either (21) or (27), for all values of , are used as estimators. Using (27) is observed to under-estimate utility loss over all datasets tested. In this case (27) should be taken as a best-case guarantee of utility loss and that the actual utility loss will be higher. We note that the ultimate assessment of utility loss is up to the user - if they do not like the personalised content they receive then they can switch to another proxy agent, or stop using the system entirely.

Vi-D Defending Privacy

We consider a proactive privacy defence strategy of injecting random queries. Between “true” queries a user issues “noise” queries to every member of the proxy agent pool other than their selected best matching proxy agent about topics other than their allocated topic of interest. This defence is motivated by the observation earlier that increased diversity of topic interests among users is reported to increase plausible deniability. By controlling the level of noise injection we hope to limit the associated utility loss. In practice this kind of injection of obfuscating, uninteresting, “noise” queries can be performed in the background by users.

Experimental setups with proxy pools of sizes and numbers of users were configured for each of the test datasets. Each proxy agent was allocated a topic as their topic of interest. Each user was allocated with a target topic of interest with setups of and of users having to model various levels of diversity of topic interests among users. After a sensitive, true input for topic was issued to a chosen proxy agent, a noise query was constructed where input keywords were drawn at random for topics other than the sensitive user topic , and issued to all proxy agents in the pool, except the last chosen proxy agent. To assess the effect of issuing different amounts of noise queries mixed with true queries, “Topic–to–Noise” ratios of and were also used. So that, for example, in the case of a true-to-noise ratio of , noise queries are issued for every true queries on average by a user. Results are reported as averages over and and topic for compactness and shown for measurements of plausible deniability in Figure 8, and for utility loss in Figure 9. The first plot in each case shows the case when there is diversity of topic interest in the proxy agent pool as a baseline.

With the random noise injection strategy plausible deniability against a proxy observer improves steadily during an experimental run for all levels of topic diversity in our experiments. For all levels of topic diversity, adding more noise results in faster improvement in plausible deniability as expected intuitively. As the topic diversity in the proxy agent pool increases, less random noise is required to produce the same changes in plausible deniability as do larger random noise levels. Intuitively this is to be expected since topic diversity is an indication of the variation in topic interests among users. Standard error in the mean, shown as shaded regions is small, indicating that improved plausible deniability is observed with high confidence for all datasets.

Utility loss, shown in Figure 9, increases initially and achieves stable levels after input–output steps with the cases where topic diversity is highest reaching a stable level quickest. Standard error is small in the case of all datasets, suggesting the average values plotted reflect expected behaviour with high confidence.

The plausible deniability and utility loss results for topic diversity are a worst-case. Even in this case the utility loss at levels of random noise up to the utility loss is after 20 steps - compared with an improvement in plausible deniability from to on average. As topic diversity increases the improvements in plausible deniability are larger than the associated utility losses in all cases. Taken overall, our results suggest that the benefits to privacy of adopting a strategy of random noise injection outweigh the associated utility losses, with the greatest benefits occurring when the privacy risk from low topic diversity is highest. Run as a background task, injecting random noise by all users in a controlled manner provides a mechanism for enforcing effective topic diversity in the proxy agent pool with corresponding benefits for privacy.

Vi-E Discussion

The results of the random proxy injection defence in our experiments suggest that once a user is alert to diversity, the 3PS setup can provide balance of probability plausible deniability of topic interests. The method of choosing revealing keyword pairs outlined in Section IV-C provides a practical bound on plausible deniability and is straightforward to apply in practice. In a production setting a browser plug-in could automatically suggest new keywords for inclusion by the user in local keyword dictionary extensions.

To apply (1) in practice, a user also needs a way of confirming that proxy agents are being truthful about the probability estimators it publishes. The notion of probe queries was introduced in [mac.P1] to allow a user to test the behaviour of black-box systems without revealing sensitive interests. By checking input–output interactions users can label the observation as sensitive or not and adjust their view of revealing keywords. The techniques introduced in [mac.P1] can be used to to check for observations that vary from that expected from (27) indicating possible concerns with the estimators distributed by that proxy agent.

Choosing to estimate plausible deniability requires care. From (27) it follows that

 ˆ\rm Pα(z∈Zu,ck|z∈Zatt,k)<ˆ\rm Pβ(z∈Zu,ck|z∈Zatt,k)

when . Choosing to include as many keywords as possible in is the safest threat detection strategy in our setup here. We have assumed here that there is no incentive for dishonesty neither is there any malicious poisoning nor accidental corruption in our setup. In a real-life, production setup when or are partially complete, poisoned or deliberately censored, a user may choose any input–output keywords for . We note that the techniques introduced in [mac.P1, mac.P2] provide tools to test when input–output keywords indicate privacy concerns that could be adapted to assist a user with constructing .

While our experiments suggest that 3PS can provide acceptable levels of plausible deniability with low utility loss, our results also emphasise the importance of maintaining adequate vigilance to prevent interests in sensitive topics from leaking and taking care to avoid overly revealing content that might compromise plausible deniability when user interests are known.

Vii Related Work

The potential for privacy concerns in recommender systems are well known in the literature. For example, Shilling attacks are discussed in [Lam:2004:SRS:988672.988726]; Sybil attacks to determine user preferences in [Calandrino:2011:YML:2006077.2006768]; Shilling attacks to sabotage recommendations in [Lam:2006:YTY:2094770.2094773], using auxiliary information to de-anonymise Netflix data [DBLP:journals/corr/abs-cs-0610105] and references therein.

Privacy preserving techniques in recommendation systems have largely focused on how to incorporate privacy into the recommendation process. In [Batmaz:2016:RPF:3010164.3010593], random perturbation of data is used to develop privacy-preserving frameworks for collaborative filtering methods. In [Boutet:2016:PDC:2975427.2975447], profile obfuscation together with a randomised dissemination protocol are employed. Another approach is to distribute the recommendation process by including a trusted intermediate agent between user and backend system, such as [aimeur2008lambic]. In [McSherry:2009:DPR:1557019.1557090], differential privacy is incorporated into the algorithms used in the Netflix prize competition to produce privacy preserving recommendations.

Grouping users behind intermediate layers is a well studied privacy technique. Protecting the sensitivity of user data, and particularly of user profiles exposed to the online system, by grouping users behind a proxy layer is defined as Level 2 Privacy in the classification scheme of online privacy approaches in [shen2007privacy]. In [petit2014towards] a third-party, privacy-proxy hosts a group profile where the privacy-proxy performs aggregation over multiple user activities to produce the group profile. In [shou2014supporting] profile generalisation is achieved locally on a user’s machine via a user-side privacy-proxy layer where the group profile is learned from a globally accessible taxonomy of topics. Obfuscating user data through profile generalisation is studied extensively in the literature. Approaches typically obfuscate or mask user interactions with search engines with the aim of disrupting online profiling and personalisation. PEAS, [Petit:2014:TEA, PEAS], combines local obfuscation with a privacy-proxy to provide unlink-ability between user and query. Shuffling user profiles as a counter to unwanted profiling in [Biega:2017:PTS:3077136.3080830]. The approach introduces a trusted third party server to shuffle individual profiles among a pool of users without regard for protecting utility. A common consideration for generalisation approaches in these works is how to provide minimally sufficient common structure to effectively generalise user interests without incurring unacceptable loss of utility. This is commonly solved by distributing generalised usage data, allowing users to create statistical patterns of usage. For example, in [shou2014supporting], a global dictionary is used that includes statistics on frequency of occurrence of concepts in it. In [petit2014towards], an intermediate privacy-proxy server distributes similar usage statistics allowing users to construct statistical models of usage.

There are examples of website proxies offering privacy preserving services to access mainstream search engines on the Internet. Two of the better known are DuckDuckGo hosted in the US on Amazon Web Services, [inc.duckduckgo_duckduckgo_2018], and StartPage hosted privately in the Netherlands, [holding_bv_surfboard_startpage_2018]. Functionally both are similar, encrypting traffic via https, and employing POST and re-direct techniques to obfuscate requests. Both claim to relieve so-called filter-bubbles, [Pariser:2011:FBI:2029079], by aggregating results from several source systems. In both DuckDuckGo and StartPage the proxy user profile adopted by users of both systems is global. Personalised content such as advertising that is displayed on search result pages is correspondingly generic.

There is evidence that users are concerned about their privacy on the Web but do not always reflect this concern in their online behaviours, [acquisti2015privacy]. In [DBLP:conf/imc/PujolHF15], in-the-wild measurements of user interactions with Ad blocking technologies suggest that users overwhelmingly accept default settings and do not install updates such as whitelists. The conclusion reached is that technologies for user privacy must be effective, but also unobtrusive and simple to maintain. By comparison with users, online systems have proven alert and adaptable in responding to attempts to protect privacy at individual user level. Stateful (cookie) and stateless (fingerprinting) tracking is widespread on the web. In [Bielova:2017:WTT:3133956.3136067, binns2018measuring, englehardt2016online, narayanan2017princeton] separate studies of 1 million websites reveal widespread data exchange among third parties, stateful tracking from third-party cookie spawning and stateless fingerprint-based tracking. In [binns2018measuring] users are observed to be tracked by multiple entities in tandem on the web.

Search engine algorithm evolution is a continuous “arms-race”, as evidenced in the case of Google, for example, by major algorithm changes such as Caffeine and Search+ Your World included additional sources of background knowledge from Social Media, improved filtering of content such as Panda to counter spam and content manipulation. More recently semantic search capability has been added through Knowledge Graph and HummingBird, [search_timeline_1], [search_history], [search_history_2]. The importance of personalising content in the online arms-race is further underlined by the continuing arms-race between Ad-blockers and web-site owners. Anti-Ad-blockers are discussed in [nithyanand2016adblocking, mughees2017detecting] in a study of 100,000 popular websites finding evidence that web-site owners are making visible changes to content in when Ad-blockers are detected. In a small number of cases pop-ups are presented that cannot be dismissed until the Adblocking software is disabled.

Viii Discussion and Conclusions

Viii-a Discussion

Accessing online systems via shared proxies in 3PS appears to provide a natural form of “hiding on the crowd” privacy once there is sufficient diversity of users and input–output interactions. Hence, when the 3PS architecture is used, the main requirement to obtain privacy is to ensure sufficient diversity. Quantities here are expressed in terms of probabilities, with randomness in the process of observing input–output interactions arising from randomness in how the user chooses inputs and any randomness in the system response. This means that practical estimation of these probabilities requires a model of user inputs and system outputs, perhaps derived from observed behaviour but in any case introducing a degree of risk that the model is inadequate and the calculated probability values inaccurate.

Our experiments indicate that the need to maintain a level of engagement and alertness with respect to individual online privacy is an unavoidable feature of online existence. A framework such as 3PS provides tools to help an engaged user to detect unwanted effects such as affinity in the proxy agent pool but the decision to engage and to take action is an unavoidably personal responsibility. As already discussed, personal judgements regarding risk seem intrinsic to discussions of privacy.

Viii-B Conclusions

Through 3PS we provide a user with the capability to achieve anonymity by adopting group identities. We provide a method to decide on the optimal choice of group identity from a pool of proxy identities. The method we develop is both efficient and scalable, and does not require a user to reveal information about their interest in sensitive topics.

We define a threat model based on notions of increasingly powerful observers with access to various levels of information about the 3PS system. Using the associated attack models we show that 3PS offers users a high degree of protection for their interests in sensitive topics.

Mass personal data collection is a persistent feature of online systems that has been justified as required for personalisation. Through the framework of the 3PS system our results suggest that much less personal data collection is required for adequate personalisation. This has significant implications for online providers in light of legislation such as GDPR that requires data to be limited to that which is proportionate to the purpose of collection.

Our experiments show that once diversity of likely interests is maintained across the proxy pool, 3PS provides high levels of protection for users while providing satisfactory personalisation. The 3PS framework provides readily implementable techniques to decide whether a particular choice of proxy agent is overly revealing of likely interest in a topic of their choice so that lack of diversity in the proxy pool is detectable by users without requiring additional infrastructure such as trusted intermediate parties. The defence of injecting noise queries is observed to improve plausible deniability while maintaining levels of utility. Automation of noise injection as well as techniques such as automated suggestion of new keywords through browser plug-in capabilities mean that 3PS can be implemented with relatively little intrusiveness on the user side by automating through, for example, a browser plug-in.

The fast-convergence and high accuracy of the proxy agent selection method, observed in our experiments indicate that 3PS can provide a safe and scalable solution that requires little retro-fitting to work with existing systems.

Overall, our results indicate 3PS is a promising first step and more research is required in large-scale production environments. Directions for future research include, undertaking a practical program of applied research to scale 3PS to a production implementation, investigating how non-text inputs and outputs such as image can be accommodated in 3PS and incorporating 3PS into a larger framework of practical privacy tools to provide robust end-to-end protection for users.

References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters