[
Abstract
Recent increase in online privacy concerns prompts the following question: can a recommender system be accurate if users do not entrust it with their private data? To answer this, we study the problem of learning itemclusters under local differential privacy, a powerful, formal notion of data privacy. We develop bounds on the samplecomplexity of learning itemclusters from privatized user inputs. Significantly, our results identify a samplecomplexity separation between learning in an informationrich and an informationscarce regime, thereby highlighting the interaction between privacy and the amount of information (ratings) available to each user.
In the informationrich regime, where each user rates at least a constant fraction of items, a spectral clustering approach is shown to achieve a samplecomplexity lower bound derived from a simple informationtheoretic argument based on Fano’s inequality. However, the informationscarce regime, where each user rates only a vanishing fraction of items, is found to require a fundamentally different approach both for lower bounds and algorithms. To this end, we develop new techniques for bounding mutual information under a notion of channelmismatch, and also propose a new algorithm, MaxSense, and show that it achieves optimal samplecomplexity in this setting.
The techniques we develop for bounding mutual information may be of broader interest. To illustrate this, we show their applicability to learning based on 1bit sketches, and adaptive learning, where queries can be adapted based on answers to past queries.
Price of Privacy]The Price of Privacy in Untrusted Recommendation Engines
Stanford University Nidhi Hegde nidhi.hegde@technicolor.com
Technicolor, Paris Research Lab Laurent Massoulié laurent.massoulie@inria.fr
Microsoft Research  INRIA Joint Center
Keywords: Differential privacy, recommender systems, lower bounds, partial information
1 Introduction
Recommender systems are fast becoming one of the cornerstones of the Internet; in a world with ever increasing choices, they are one of the most effective ways of matching users with items. Today, many websites use some form of such systems. Research in these algorithms received a fillip from the Netflix prize competition in 2009. Ironically, however, the contest also exposed the Achilles heel of such systems, when Narayanan and Shmatikov (2006) demonstrated that the Netflix data could be deanonymized. Subsequent works (for example, Calandrino et al. (2011)) have reinforced belief in the frailty of these algorithms in the face of privacy attacks.
To design recommender systems in such scenarios, we first need to define what it means for a datarelease mechanism to be private. The popular perception has coalesced around the notion that a person can either participate in a recommender system and waive all claims to privacy, or avoid such systems entirely. The response of the research community to these concerns has been the development of a third paradigm between complete exposure and complete silence. This approach has been captured in the formal notion of differential privacy (refer Dwork (2006)); essentially it suggests that although perfect privacy is impossible, one can control the leakage of information by deliberately corrupting sensitive data before release. The original definition in Dwork (2006) provides a statistical test that must be satisfied by a datarelease mechanism to be private. Accepting this paradigm shifts the focus to designing algorithms that obey this constraint while maximizing relevant notions of utility. This tradeoff between utility and privacy has been explored for several problems in database management Blum et al. (2005); Dwork (2006); Dwork et al. (2006, 2010a, 2010b) and learning Blum et al. (2008); Chaudhuri et al. (2011); Gupta et al. (2011); Kasiviswanathan et al. (2008); McSherry and Mironov (2009); Smith (2011).
In the context of recommender systems, there are two models for ensuring privacy: centralized and local. In the centralized model, the recommender system is trusted to collect data from users; it then responds to queries by publishing results that have been corrupted via some differentially private mechanism. However, users increasingly desire control over their private data, given their mistrust in centralized databases (which is supported by examples such as the Netflix privacy breach). In cases where the database cannot be trusted to keep data confidential, users can store their data locally, and differential privacy is ensured through suitable randomization at the ‘userend’ before releasing data to the recommender system. This is precisely the context of the present paper: the design of differentially private algorithms for untrusted recommender systems.
The latter model is variously known in privacy literature as local differential privacy (see Kasiviswanathan et al. (2008); we henceforth refer to it as localDP ), and in statistics as the ‘randomized response technique’ (see Warner (1965)). However, there are two unique challenges to localDP posed by recommender systems which have not been satisfactorily dealt with before:

The underlying space (here, the set of ratings over all items) has very high dimensionality.

The users have limited information: they rate only a (vanishingly small) fraction of items.
In this work we address both these issues. We consider the problem of learning an unknown (lowdimensional) clustering for a large set of items from privatized userfeedback. Surprisingly, we demonstrate a sharp change in the samplecomplexity of localDP learning algorithms when shifting from an informationrich to an informationscarce regime – no similar phenomenon is known for nonprivate learning. With the aid of new informationtheoretic arguments, we provide lower bounds on the samplecomplexity in various regimes. On the other hand, we also develop novel algorithms, particularly in the informationscarce setting, which match the lower bounds up to logarithmic factors. Thus although we pay a ‘price of privacy’ when ensuring localDP in untrusted recommender systems with informationscarcity, we can design optimal algorithms for such regimes.
1.1 Our Results
We focus on learning a generative model for the data, under userend, or local differential privacy constraints. Local differential privacy ensures that user data is privatized before being made available to the recommender system – the aim of the system is thus to learn the model from privatized responses to (appropriately designed) queries. The metric of interest is the samplecomplexity – the minimum number of users required for efficient learning.
Formally, given a set of items, we want to learn a partition or clustering of the itemset, such that items within a cluster are statistically similar (in terms of userratings). The class of models (or hypothesis class) we wish to learn is thus the set of mappings from items ^{1}^{1}1Throughout the paper, we use to denote the set . to clusters (where typically ). The system can collect information from users, where each user has rated only out of the items, and interacts with the system via a mechanism satisfying localDP. To be deemed successful, we require that an algorithm identify the correct cluster label for all items^{2}^{2}2This is for ease of exposition – our results extend to allowing a fraction of itemmisclassifications, c.f. Appendix A..
To put the above model in perspective, consider the problem of movierecommendation – here items are movies, and the recommender system wants to learn a clustering of these movies, wherein two movies in a cluster are ‘similar’. We assume that each user has watched movies, but is unwilling to share these ratings with the recommender system without appropriate privatization of their data. Once the recommender system has learnt a good clustering, it can make this knowledge public, allowing users to obtain their own recommendations, based on their viewing history. This is similar in spirit to the ‘You Might Also Like’ feature on IMDB or Amazon.
Our starting point for samplecomplexity bounds is the following basic lower bound (c.f. Section 2 for details):
Informal Theorem 1
(Theorem 7) For any (finite) hypothesis class to be ‘successfully’ learned under localDP, the number of users must satisfy:
The above theorem is based on a standard use of Fano’s inequality in statistical learning. Similar connections between differential privacy and mutual information have been established before (c.f. Section 1.2) – we include it here as it helps put our main results in perspective.
Returning to the recommender system problem, note that for the problem of learning itemclusters, . We next consider an informationrich setting, wherein , i.e., each user knows ratings for a constant fraction of the items. We show the above bound is matched (up to logarithmic factors) by a localDP algorithm based on a novel ‘pairwisepreference’ sketch and spectral clustering techniques:
Informal Theorem 2
(Theorem 8) In the informationrich regime under localDP, clustering via the PairwisePreference Algorithm succeeds if the number of users satisfies:
The above theorems thus provide a complete picture of the informationrich setting. In practical scenarios, however, is quite small; for example, in a movie ratings system, users usually have seen and rated only a vanishing fraction of movies. Our main results in the paper concern nonadaptive, localDP learning in the informationscarce regime – wherein . Herein, we observe an interesting phasechange in the samplecomplexity of private learning:
Informal Theorem 3
In the informationscarce regime under localDP, the number of users required for nonadaptive cluster learning must satisfy: (Theorem 13).
Furthermore, for small , in particular, , we have: (Theorem 14).
To see why this result is surprising, consider the following toy problem: each item belongs to one of two clusters. Users arrive, sample a single item uniformly at random and learn its corresponding cluster, answer a query from the recommender system, and leave.
For nonprivate learning, if there is no constraint on the amount of information exchanged between the user and the algorithm, then the number of users needed for learning the clusters is (via a simple couponcollector argument). Note that the amount of data each user has is (item indexcluster). Now if we put a constraint that the average amount of information exchanged between a user and the algorithm is bit, then intuition suggests that the recommender system now needs users. This is achieved by the following simple strategy: each user reveals her complete information with probability , else reveals no information – clearly the amount of information exchanged per user is bit on average, and a modified coupon collector argument shows that this scheme requires users to learn the item clusters.
However, the situation changes if we impose a condition that the amount of information exchanged is exactly bit per user (for example, the algorithm asks a yes/no question to the user); as a sideproduct of the techniques we develop for Theorem 14, we show that the number of users required in this case is (c.f. Theorem 10). This fundamental change in samplecomplexity scaling is due to the combination of users having limited information and a ‘peruser information’ constraint (as opposed to the average information constraint). One major takeaway of our work is that local differential privacy in the informationscarce regime has a similar effect.
Finally for the informationscarce regime, we develop a new algorithm, MaxSense, which (under appropriate separation conditions) matches the above bound up to logarithmic factors:
Informal Theorem 4
Techniques: Our main technical contribution lies in the tools we use for the lower bounds in the informationscarce setting. By viewing the privacy mechanism as a noisy channel with appropriate constraints, we are able to use information theoretic methods to obtain bounds on private learning. Although connections between privacy and mutual information have been considered before (refer McGregor et al. (2010); Alvim et al. (2011)), existing techniques do not capture the change in samplecomplexity in highdimensional regimes. We formalize a new notion of ‘channel misalignment’ between the ‘sampling channel’ (the partial ratings known to the users) and the privatization channel. In Section 4 we provide a structural lemma (Lemma 9) that quantifies this mismatch under general conditions, and demonstrate its use by obtaining tight lower bounds under bit (nonprivate) sketches. In Section 4.3 we use it to obtain tight lower bounds under localDP. In Section 6 we discuss its application to adaptive localDP algorithms, establishing a lower bound of order – note that this again is a refinement on the bound in Theorem 7. Though we focus on the item clustering problem, our lower bounds apply to learning any finite hypothesis class under privacy constraints.
The information theoretic results also suggest that bit privatized sketches are sufficient for learning in such scenarios. Based on this intuition, we show how existing spectralclustering techniques can be extended to private learning in some regimes. More significantly, in the informationscarce regime, where spectral learning fails, we develop a novel algorithm based on blind probing of a large set of items. This algorithm, in addition to being private and having optimal samplecomplexity in many regimes, suggests several interesting open questions, which we discuss in Section 7.
1.2 Related Work
Privacy preserving recommender systems: The design of recommender systems with differential privacy was studied by McSherry and Mironov (2009) under the centralized model. Like us, they separate the recommender system into two components, a learning phase (based on a database appropriately perturbed to ensure privacy) and a recommendation phase (performed by the users ‘at home’, without interacting with the system). They numerically compare the performance of the algorithm against nonprivate algorithms. In contrast, we consider a stronger notion of privacy (localDP), and for our generative model, are able to provide tight analytical guarantees and further, quantify the impact of limited information on privacy.
Private PAC Learning and Query Release: Several works have considered private algorithms for PAClearning. Blum et al. (2008); Gupta et al. (2011) consider the private query release problem (i.e., releasing approximate values for all queries in a given class) in the centralized model. Kasiviswanathan et al. (2008) show equivalences between: a) centralized private learning and agnostic PAC learning, b) localDP and the statistical query (SQ) model of learning; this line of work is further extended by Beimel et al. (2010). Although some of our results (in particular, Theorem 7) are similar in spirit to lower bounds for PAC (see Kasiviswanathan et al. (2008); Beimel et al. (2010) there are significant differences both in scope and technique. Furthermore:

We emphasize the importance of limited information, and characterize its impact on learning with localDP. Hitherto unconsidered,information scarcity is prevalent in practical scenarios, and as our results shows, it has strong implications on learning performance under localDP.
Privacy in Statistical Learning: A large body of recent work has looked at the impact of differential privacy on statistical learning techniques. A majority of this work focusses on centralized differential privacy. For example, Chaudhuri et al. (2011) consider privacy in the context of empirical risk minimization; they analyze the release of classifiers, obtained via algorithms such as SVMs, with (centralized) privacy constraints on the training data.Dwork and Lei (2009) study algorithms for privacypreserving regression under the centralized model; these however require running time which is exponential in the data dimension. Smith (2011) obtains private, asymptoticallyoptimal algorithms for statistical estimation, again though, in the centralized model.
More recently, Duchi et al. (2013) consider the problem of finding minimax rates for statistical estimators under localDP. Their techniques are based on refined analysis of information theoretic quantities, including generalizations of the Fano’s Inequality bounds we use in Section 3.1. However, the estimation problems they consider have a simpler structure – in particular, they involve learning from samples generated directly from an underlying model (albeit privatized). What makes our setting challenging is the combination of a generative model (the bipartite stochastic blockmodel) with incomplete information (due to useritem sampling) – it seems unlikely that the techniques of Duchi et al. (2013) can extend easily to our setting. Moreover, lower bound techniques do not naturally yield good algorithms
Other Notions of Privacy: The localDP model which we consider has been studied before in privacy literature (Kasiviswanathan et al. (2008); Dwork et al. (2006)) and statistics (Warner (1965)). It is a stronger notion than central differential privacy, and also stronger than two other related notions: panprivacy (Dwork et al. (2010b)) where the database has to also deal with occasional release of its state, and privacy under continual observations (Dwork et al. (2010a)), where the database must deal with additions and deletions, while maintaining privacy.
Recommendation algorithms based on incoherence: Apart from privacypreserving algorithms, there is a large body of work on designing recommender systems under various constraints (usually lowrank) on the ratings matrix (for example, Wainwright (2009); Keshavan et al. (2010)). These methods, though robust, fail in the presence of privacy constraints, as the noise added as a result of privatization is much more than their noisetolerance. This is intuitive, as successful matrix completion would constitute a breach of privacy; our work builds the case for using simpler lower dimensional representations of the data, and simpler algorithms based on extracting limited information (in our case, bit sketches) from each user.
2 Preliminaries
We now present our system model, formally define different notions of differential privacy, and introduce some tools from information theory that form the basis of our proofs.
2.1 The Bipartite Stochastic BlockModel
Recommender system typically assume the existence of an underlying lowdimensional generative model for the data – the aim then is to learn parameters of this model, and then, use the learned model to infer unknown useritem rankings. In this paper we consider a model wherein items and users belong to underlying clusters, and a user’s ratings for an item depend only on the clusters they belong to. This is essentially a bipartite version of the Stochastic Blockmodel Holland et al. (1983), widely used in model selection literature. The aim of the recommendation algorithm is to learn these clusters, and then reveal them to the users, who can then compute their own recommendations privately. Our model, though simpler than the state of the art in recommender systems, is still rich enough to account for many of the features seen empirically in recommender systems. In addition it yields reasonable accuracy in nonprivate settings on meaningful datasets (c.f. Tomozei and Massoulié (2011)).
Formally, let be the set of users and the set of items. The set of users is divided into clusters , where cluster contains users. Similarly, the set of items is divided into clusters , where cluster contains items. We use to denote the (incomplete) matrix of user/item ratings, where each row corresponds to a user, and each column an item. For simplicity, we assume ; for example, this could correspond to ‘like/dislike’ ratings. Finally we have the following statistical assumption for the ratings – for user with user class , and item with item class , the rating is given by a Bernoulli random variable . Ratings for different useritem pairs are assumed independent.
In order to model limited information, i.e., the fact that users rate only a fraction of all items, we define a parameter to be the number of items a user has rated. More generally, we only need to know in an orderwise sense – for example, for some function . We assume that the rated items are picked uniformly at random. We define to be the informationrich regime, and to be the informationscarce regime.
Given this model, the aim of the recommender system is to learn the itemclusters from useritem ratings. Note that the difficulty in doing so is twofold:

The useritem ratings matrix is incomplete – in particular, each user has ratings for only out of items.

Users share their information only via a privacypreserving mechanism (as we discuss in the next section).
Our work exposes how these two factors interact to affect the samplecomplexity, i.e., the minimum number of users required to learn the itemclusters. We note also that another difficulty in learning is that the useritem ratings are noisy – however, as long as this noise does not depend on the number of items, this does not affect the samplecomplexity scaling.
2.2 Differential Privacy
Differential privacy is a framework that defines conditions under which an algorithm can be said to be privacy preserving with respect to the input. Formally (following Dwork (2006)):
Definition 1
(Differential Privacy) A randomized function that maps data to is said to be differentially private if, for all values in the range space of , and for all ‘neighboring’ data , we have:
(1) 
We assume that conditioned on is independent of any external side information (in other words, the output of mechanism depends only on and its internal randomness). The definition of ‘neighboring’ is chosen according to the situation, and determines the data that remain private. In the original definition Dwork (2006), two databases are said to be neighbors if the larger database is constructed by adding a single tuple to the smaller database. In the context of ratings matrices, two matrices can be neighbors if they differ in: a single row (peruser privacy), or a single rating (perrating privacy).
Two crucial properties of differential privacy are composition and postprocessing. We state these here without proof; c.f. Dwork (2006) for details. Composition captures the reduction in privacy due to sequentially applying multiple differentiallyprivate release mechanisms:
Proposition 2
(Composition) If outputs, are obtained from data by different randomized functions, , where is differentially private, then the resultant function is differentially private.
Postprocessing states that processing the output of a differentially private release mechanism can only make it more differentially private (i.e., with a smaller ) visavis the input:
Proposition 3
(Postprocessing) If a function is differentially private, then any composition function is differentially private for some .
In settings where the database curator is untrusted, an appropriate notion of privacy is local differential privacy (or localDP). For each user , let be its private data – in the recommendation context, the rateditem labels and corresponding ratings – and let be the data that the user makes publicly available to the untrusted curator. LocalDP requires that is differentially private w.r.t. . This paradigm is similar to the Randomized Response technique in statistics Warner (1965). It is the natural notion of privacy in the case of untrusted databases, as the data is privatized at the userend before storage in the database; to emphasize this, we alternately refer to it as Userend Differential Privacy.
We conclude this section with a mechanism for releasing a single bit under differential privacy. Differential privacy for this mechanism is easy to verify using equation 1.
Proposition 4
(DP bit release): Given bit , set output to be equal to with probability , else equal to . Then is differentially private w.r.t. .
2.3 Preliminaries from Information Theory
For a random variable taking values in some discrete space , its entropy is defined as ^{3}^{3}3For notational convenience, we use as the logarithm to the base throughout; hence, the entropy is in ‘bits’. For two random variables , the mutual information between them is given by:
Our main tools for constructing lower bounds are variants of Fano’s Inequality, which are commonly used in nonparametric statistics literature (c.f. Santhanam and Wainwright (2009); Wainwright (2009)). Consider a finite hypothesis class , indexed by . Suppose that we choose a hypothesis uniformly at random from , sample a data set of samples drawn in an i.i.d. manner according to a distribution (in our case, corresponds to a user, and the ratings drawn according to the statistical model in Section 2.1), and then provide a private version of this data to the learning algorithm. We can represent this as the Markov chain:
Further, we define a given learning algorithm to be unreliable for the hypothesis class if for a hypothesis drawn uniformly at random, we have .
Fano’s inequality provides a lower bound on the probability of error under any learning algorithm in terms of the mutual information between the underlying hypotheses and the samples. A basic version of the inequality is as follows:
Lemma 5
(Fano’s Inequality) Given a hypothesis drawn uniformly from , and samples drawn according to , for any learning algorithm, the average probability of error satisfies:
(2) 
As a direct consequence of this result, if the samples are such that , then any algorithm fails to correctly identify almost all of the possible underlying models. Though this is a weak bound, equation 2 turns out to be sufficient to study samplecomplexity scaling in the cases we consider. In Appendix A, we consider stronger versions of the above lemma, as well as more general criterion for approximate model selection (e.g., allowing for distortion).
3 ItemClustering under LocalDP: The InformationRich Regime
In this section, we derive a basic lower bound on the number of users needed for accurate learning under local differential privacy. This relies on a simple bound on the mutual information between any database and its privatized output, and hence is applicable in general settings. Returning to itemclustering, we give an algorithm that matches the optimal scaling (up to logarithmic factor) under one of the following two conditions: , i.e., each user has rated a constant fraction of items (the informationrich regime), or only the ratings are private, not the identity of the rated items.
3.1 Differential Privacy and Mutual Information
We first present a lemma that characterizes the mutual information leakage across any differentially private channel:
Lemma 6
Given (private) r.v. , a privatized output obtained by any locally DP mechanism , and any side information , we have:
Lemma 6 follows directly from the definitions of mutual information and differential privacy (note that for any such mechanism, the output given the input is conditionally independent of any sideinformation). We note that similar results have appeared before in literature; for example, equivalent statements appear in McGregor et al. (2010); Alvim et al. (2011). We present the proof here for the sake of completeness:
Proof [Proof of Lemma 6]
Here inequality is a direct application of the definition of differential privacy (Equation 1), and in particular, the fact that it holds for any side information.
Returning to the private learning of item classes, we obtain a lower bound on the samplecomplexity by considering the following special case of the itemclustering problem: consider , and let be a mapping of the item set to two classes represented as – hence the size of the hypothesis class is . Each user has some private data , which is generated via the bipartite Stochastic Blockmodel (c.f., Section 2.1). Recall we define a learning algorithm to be unreliable for if . Using Lemma 6 and Fano’s inequality (Lemma 5), we get the following lower bound on the samplecomplexity:
Theorem 7
Suppose the underlying clustering is drawn uniformly at random from . Then any learning algorithm obeying localDP is unreliable if the number of queries satisfies: .
Proof We now have the following informationflow model for each user (under localDP):
Here sampling refers to each user rating a subset of items. Now by using the DataProcessing Inequality (Theorem from Cover and Thomas (2006)), followed by Lemma 6, we have that:
Fano’s inequality (Lemma 5) then implies that a learning algorithm is unreliable if the number of queries satisfies: .
We note here that the above theorem, though stated for the bipartite Stochastic Blockmodel, in fact gives samplecomplexity bounds for more general modelselection problems. Further, in Appendix A, we extend the result to allow for distortion – wherein the algorithm is allowed to make a mistake on some fraction of itemlabels.
For the bipartite Stochastic Blockmodel, though the above bound is not the tightest, it turns out to be achievable (up to log factors) in the informationrich regime, as we show next. We note that a similar bound was given by Beimel et al. (2010) for PAClearning under centralized DP, using more explicit counting techniques. Both our results and the bounds in Beimel et al. (2010) fail to exhibit the correct scaling in the informationscarce case () setting. However, unlike proofs based on counting arguments, our method allows us to leverage more sophisticated information theoretic tools for other variants of the problem, like those we consider subsequently in Section 4.
3.2 ItemClustering in the InformationRich Regime
To conclude this section, we outline an algorithm for clustering in the informationrich regime. The algorithm proceeds as follows: the recommendation algorithm provides each user with two items picked at random, whereupon the user computes a private sketch which is equal to if she rated the two items positively, and else , users release a privatized version of their private sketch using the DP bit release mechanism, the algorithm constructs matrix , where entry is obtained by adding the sketches from all users queried with itempair , and finally performs spectral clustering of items based on matrix . This algorithm, which we refer to as the PairwisePreference algorithm, is formally specified in Figure 1.
Setting: Items , Users . Each user has set of ratings . Each item associated with a cluster .
Return: Cluster labels
Stage 1 (User sketch generation):

For each user , pick items :

At random if

If is known, pick two random rated items.


User generates a private sketch given by:
where if , and otherwise.
Stage 2 (User sketch privatization):
Each user releases privatized sketch from using the DP bit release mechanism (Proposition 4).
Stage 3 (Spectral Clustering):

Generate a pairwisepreference matrix , where:

Extract the top normalized eigenvectors (corresponding to largest eigenvalues of ).

Project each row of into the dimensional profile space of the top eigenvectors.

Perform kmeans clustering in the profile space to get the item clusters
Recall in the bipartite Stochastic Blockmodel, we assume that the users belong tp clusters, each of size . We now have the following theorem that characterizes the performance of the PairwisePreference algorithm.
Theorem 8
The PairwisePreference algorithm satisfies localDP. Further, suppose the eigenvalues and eigenvectors of satisfy the following nondegeneracy conditions:

The largest magnitude eigenvalues of have distinct absolute values.

The corresponding eigenvectors , normalized under the norm, , for some satisfy:
where .
Then, in the informationrich regime (i.e., when ), there exists such that the item clustering is successful with high probability if the number of users satisfies:
Proof [Proof Outline]
Local differential privacy under the PairwisePreference algorithm is guaranteed by the use of DP bit release, and the composition property. The performance analysis is based on a result on spectral clustering by Tomozei and Massoulié (2011). The main idea is to interpret as representing the edges of a random graph over the item set, with an edge between an item in class and another in class if . In particular, from the definition of the Pairwise Preference algorithm, we can compute that the probability of such an edge is . This puts us in the setting analyzed by Tomozei and Massoulié (2011) – we can now use their spectral clustering bounds to get the result. For the complete proof, refer Appendix B.
4 LocalDP in the InformationScarce Regime: Lower Bounds
As in the previous lower bound, we consider a simplified version of the problem, where there is a single class of users, and each item is ranked either or deterministically by each user (i.e., for all items). Let be the underlying clustering function; in general we can think of this as an bit vector . We assume that the userdata for user is given by , where is a size subset of representing items rated by user , and are the ratings for the corresponding items; in this case, . The set is assumed to be chosen uniformly at random from amongst all size subsets of . We also denote the privatized sketch from user as . Here the space to which sketches belong is assumed to be an arbitrary finite or countably infinite space. The sketch is assumed differentially private. Finally, as before, we assume that is chosen uniformly over . Thus we have the following informationflow model for the user :
Now to get tighter lower bounds on the number of users needed for accurate item clustering, we need more accurate bounds on the mutual information between the underlying model on itemclustering and the data available to the algorithm. The main idea behind our lower bound techniques is to view the above chain as a combination of two channels – the first wherein the userdata is generated (sampled) by the underlying statistical model, and the second wherein the algorithm receives a sketch of the user’s data. We then develop a new information inequality that allows us to bound the mutual information in terms of the mismatch between the channels. This technique turns out to be useful in settings without privacy as well – in Section 4.2, we show how it can be used to get samplecomplexity bounds for learning with bit sketches.
4.1 Mutual Information under Channel Mismatch
We now establish a bound for the mutual information between a statistical model and a lowdimensional sketch, which is the main tool we use to get samplecomplexity lower bounds. We define to be the collection of all size subsets of , and to be the set from which user information (i.e., ) is drawn, and define . Finally indicates that the expectation is over the random variable .
Lemma 9
Given the Markov Chain , let be two pairs of ‘userdata’ sets which are independent and identically distributed according to the conditional distribution of the pair given . Then, the mutual information satisfies:
where we use the notation to denote that the two userdata sets are consistent on the index set on which they overlap, i.e.,
Proof For brevity, we use the shorthand notation and finally . Now we have:
(3) 
Let . Similar to above, we use the shorthand notation , where are random variables and their corresponding realizations. Now we have:
(Summing over )  
(By the Markov property)  
(Since )  
(4) 
where the last equality is obtained using the fact that the type of each item is independent and uniformly distributed over . Next, using a similar set of steps, we have:
Where  
(5) 
Finally, we combine equations (3),(4) and (5) together to get the result:
We note here that the above lemma is a special case (where takes the uniform measure over ) of a more general lemma, which we state and prove in Appendix C
4.2 SampleComplexity for Learning with bit Sketches
To demonstrate the use of Lemma 9, we first consider a related problem that demonstrates the effect of peruser constraints (as opposed to average constraints) on the mutual information. We consider the same itemclass learning problem as before with (i.e., each user has access to a single rating), but instead of a privacy constraint, we consider a ‘peruser bandwidth’ constraint, wherein each user can communicate only a single bit to the learning algorithm.
Theorem 10
Suppose , with drawn i.i.d uniformly over . Then for any bit sketch derived from , it holds that: and consequently, there exists a constant such that any cluster learning algorithm using queries with bit responses is unreliable if the number of users satisfies
Proof In order to use Lemma 9, we first note that is a convex function of for fixed (Theorem in Cover and Thomas (2006)). Writing as , we observe that the extremal points of the kernel correspond to , where the mutual information is maximized. This implies that the class of deterministic queries with bit response that maximizes mutual information has the following structure: given userdata , the algorithm provides user with an arbitrary set of (items,ratings), and the user identifies if is contained in . Formally, the query is denoted (i.e., is ).
Defining , for a query response , we have the following:
and similarly where is the complement of set . From Lemma 9, for r.v.s , we have:
Introducing the notation , the following identity is easily established:
(6) 
The RHS of (6) is a nonnegative definite quadratic form of the variables (since ). Thus:
where if and if . Now for a given , consider the partitioning of the set into , where for , . We then have the following:
Now, using Fano’s inequality (Lemma 5) to get the result.
Note that the above bound is tight – to see this, consider a (adaptive) scheme where each user is asked a random query of the form “Is ?”(where and ). The average time between two successful queries is , and one needs successful queries to learn all the bits. This demonstrates an interesting change in the samplecomplexity of learning with peruser communications constraints (bit sketches in this section, privacy in next section) versus averageuser constraints (mutual information bound or average bandwidth).
4.3 SampleComplexity for Learning under LocalDP
We now exploit the above techniques to obtain lower bounds on the scaling required for accurate clustering with DP in an informationscarce regime, i.e., when . To do so, we first require a technical lemma that establishes a relation between the distribution of a random variable with and without conditioning on a differentially private sketch:
Lemma 11
Given a discrete random variable and some differentially private ‘sketch’ variable generated from , there exists a function such that for any and :
(7) 
Proof
Thus, we can define:
Further, from the definition of DP, we have:
and hence we have .
Recall we define to be the set from which user information is drawn. We write for the base probability distribution on and (note: the two are i.i.d uniform) over , and denote by mathematical expectation under . We also need the following estimate (c.f. Appendix C for the proof):
Lemma 12
If , then:
We can prove our tightened bounds. We first obtain a weak lower bound in Theorem 13, valid for all , and then refine it in Theorem 14 under additional conditions.
Theorem 13
In the informationscarce regime, i.e., when , under localDP we have:
and consequently, there exists a constant such that any cluster learning algorithm with localDP is unreliable if the number of users satisfies
Proof To bound the mutual information between the underlying model and each private sketch, we use Lemma 9. In particular, we show that the mutual information is bounded by for any given value of the private sketch.