Recent increase in online privacy concerns prompts the following question: can a recommender system be accurate if users do not entrust it with their private data? To answer this, we study the problem of learning item-clusters under local differential privacy, a powerful, formal notion of data privacy. We develop bounds on the sample-complexity of learning item-clusters from privatized user inputs. Significantly, our results identify a sample-complexity separation between learning in an information-rich and an information-scarce regime, thereby highlighting the interaction between privacy and the amount of information (ratings) available to each user.
In the information-rich regime, where each user rates at least a constant fraction of items, a spectral clustering approach is shown to achieve a sample-complexity lower bound derived from a simple information-theoretic argument based on Fano’s inequality. However, the information-scarce regime, where each user rates only a vanishing fraction of items, is found to require a fundamentally different approach both for lower bounds and algorithms. To this end, we develop new techniques for bounding mutual information under a notion of channel-mismatch, and also propose a new algorithm, MaxSense, and show that it achieves optimal sample-complexity in this setting.
The techniques we develop for bounding mutual information may be of broader interest. To illustrate this, we show their applicability to learning based on 1-bit sketches, and adaptive learning, where queries can be adapted based on answers to past queries.
Price of Privacy]The Price of Privacy in Untrusted Recommendation Engines
Keywords: Differential privacy, recommender systems, lower bounds, partial information
Recommender systems are fast becoming one of the cornerstones of the Internet; in a world with ever increasing choices, they are one of the most effective ways of matching users with items. Today, many websites use some form of such systems. Research in these algorithms received a fillip from the Netflix prize competition in 2009. Ironically, however, the contest also exposed the Achilles heel of such systems, when Narayanan and Shmatikov (2006) demonstrated that the Netflix data could be de-anonymized. Subsequent works (for example, Calandrino et al. (2011)) have reinforced belief in the frailty of these algorithms in the face of privacy attacks.
To design recommender systems in such scenarios, we first need to define what it means for a data-release mechanism to be private. The popular perception has coalesced around the notion that a person can either participate in a recommender system and waive all claims to privacy, or avoid such systems entirely. The response of the research community to these concerns has been the development of a third paradigm between complete exposure and complete silence. This approach has been captured in the formal notion of differential privacy (refer Dwork (2006)); essentially it suggests that although perfect privacy is impossible, one can control the leakage of information by deliberately corrupting sensitive data before release. The original definition in Dwork (2006) provides a statistical test that must be satisfied by a data-release mechanism to be private. Accepting this paradigm shifts the focus to designing algorithms that obey this constraint while maximizing relevant notions of utility. This trade-off between utility and privacy has been explored for several problems in database management Blum et al. (2005); Dwork (2006); Dwork et al. (2006, 2010a, 2010b) and learning Blum et al. (2008); Chaudhuri et al. (2011); Gupta et al. (2011); Kasiviswanathan et al. (2008); McSherry and Mironov (2009); Smith (2011).
In the context of recommender systems, there are two models for ensuring privacy: centralized and local. In the centralized model, the recommender system is trusted to collect data from users; it then responds to queries by publishing results that have been corrupted via some differentially private mechanism. However, users increasingly desire control over their private data, given their mistrust in centralized databases (which is supported by examples such as the Netflix privacy breach). In cases where the database cannot be trusted to keep data confidential, users can store their data locally, and differential privacy is ensured through suitable randomization at the ‘user-end’ before releasing data to the recommender system. This is precisely the context of the present paper: the design of differentially private algorithms for untrusted recommender systems.
The latter model is variously known in privacy literature as local differential privacy (see Kasiviswanathan et al. (2008); we henceforth refer to it as local-DP ), and in statistics as the ‘randomized response technique’ (see Warner (1965)). However, there are two unique challenges to local-DP posed by recommender systems which have not been satisfactorily dealt with before:
The underlying space (here, the set of ratings over all items) has very high dimensionality.
The users have limited information: they rate only a (vanishingly small) fraction of items.
In this work we address both these issues. We consider the problem of learning an unknown (low-dimensional) clustering for a large set of items from privatized user-feedback. Surprisingly, we demonstrate a sharp change in the sample-complexity of local-DP learning algorithms when shifting from an information-rich to an information-scarce regime – no similar phenomenon is known for non-private learning. With the aid of new information-theoretic arguments, we provide lower bounds on the sample-complexity in various regimes. On the other hand, we also develop novel algorithms, particularly in the information-scarce setting, which match the lower bounds up to logarithmic factors. Thus although we pay a ‘price of privacy’ when ensuring local-DP in untrusted recommender systems with information-scarcity, we can design optimal algorithms for such regimes.
1.1 Our Results
We focus on learning a generative model for the data, under user-end, or local differential privacy constraints. Local differential privacy ensures that user data is privatized before being made available to the recommender system – the aim of the system is thus to learn the model from privatized responses to (appropriately designed) queries. The metric of interest is the sample-complexity – the minimum number of users required for efficient learning.
To put the above model in perspective, consider the problem of movie-recommendation – here items are movies, and the recommender system wants to learn a clustering of these movies, wherein two movies in a cluster are ‘similar’. We assume that each user has watched movies, but is unwilling to share these ratings with the recommender system without appropriate privatization of their data. Once the recommender system has learnt a good clustering, it can make this knowledge public, allowing users to obtain their own recommendations, based on their viewing history. This is similar in spirit to the ‘You Might Also Like’ feature on IMDB or Amazon.
Our starting point for sample-complexity bounds is the following basic lower bound (c.f. Section 2 for details):
Informal Theorem 1
(Theorem 7) For any (finite) hypothesis class to be ‘successfully’ learned under -local-DP, the number of users must satisfy:
The above theorem is based on a standard use of Fano’s inequality in statistical learning. Similar connections between differential privacy and mutual information have been established before (c.f. Section 1.2) – we include it here as it helps put our main results in perspective.
Returning to the recommender system problem, note that for the problem of learning item-clusters, . We next consider an information-rich setting, wherein , i.e., each user knows ratings for a constant fraction of the items. We show the above bound is matched (up to logarithmic factors) by a local-DP algorithm based on a novel ‘pairwise-preference’ sketch and spectral clustering techniques:
Informal Theorem 2
(Theorem 8) In the information-rich regime under -local-DP, clustering via the Pairwise-Preference Algorithm succeeds if the number of users satisfies:
The above theorems thus provide a complete picture of the information-rich setting. In practical scenarios, however, is quite small; for example, in a movie ratings system, users usually have seen and rated only a vanishing fraction of movies. Our main results in the paper concern non-adaptive, local-DP learning in the information-scarce regime – wherein . Herein, we observe an interesting phase-change in the sample-complexity of private learning:
Informal Theorem 3
In the information-scarce regime under -local-DP, the number of users required for non-adaptive cluster learning must satisfy: (Theorem 13).
Furthermore, for small , in particular, , we have: (Theorem 14).
To see why this result is surprising, consider the following toy problem: each item belongs to one of two clusters. Users arrive, sample a single item uniformly at random and learn its corresponding cluster, answer a query from the recommender system, and leave.
For non-private learning, if there is no constraint on the amount of information exchanged between the user and the algorithm, then the number of users needed for learning the clusters is (via a simple coupon-collector argument). Note that the amount of data each user has is (item indexcluster). Now if we put a constraint that the average amount of information exchanged between a user and the algorithm is bit, then intuition suggests that the recommender system now needs users. This is achieved by the following simple strategy: each user reveals her complete information with probability , else reveals no information – clearly the amount of information exchanged per user is bit on average, and a modified coupon collector argument shows that this scheme requires users to learn the item clusters.
However, the situation changes if we impose a condition that the amount of information exchanged is exactly bit per user (for example, the algorithm asks a yes/no question to the user); as a side-product of the techniques we develop for Theorem 14, we show that the number of users required in this case is (c.f. Theorem 10). This fundamental change in sample-complexity scaling is due to the combination of users having limited information and a ‘per-user information’ constraint (as opposed to the average information constraint). One major takeaway of our work is that local differential privacy in the information-scarce regime has a similar effect.
Finally for the information-scarce regime, we develop a new algorithm, MaxSense, which (under appropriate separation conditions) matches the above bound up to logarithmic factors:
Informal Theorem 4
Techniques: Our main technical contribution lies in the tools we use for the lower bounds in the information-scarce setting. By viewing the privacy mechanism as a noisy channel with appropriate constraints, we are able to use information theoretic methods to obtain bounds on private learning. Although connections between privacy and mutual information have been considered before (refer McGregor et al. (2010); Alvim et al. (2011)), existing techniques do not capture the change in sample-complexity in high-dimensional regimes. We formalize a new notion of ‘channel mis-alignment’ between the ‘sampling channel’ (the partial ratings known to the users) and the privatization channel. In Section 4 we provide a structural lemma (Lemma 9) that quantifies this mismatch under general conditions, and demonstrate its use by obtaining tight lower bounds under -bit (non-private) sketches. In Section 4.3 we use it to obtain tight lower bounds under local-DP. In Section 6 we discuss its application to adaptive local-DP algorithms, establishing a lower bound of order – note that this again is a refinement on the bound in Theorem 7. Though we focus on the item clustering problem, our lower bounds apply to learning any finite hypothesis class under privacy constraints.
The information theoretic results also suggest that -bit privatized sketches are sufficient for learning in such scenarios. Based on this intuition, we show how existing spectral-clustering techniques can be extended to private learning in some regimes. More significantly, in the information-scarce regime, where spectral learning fails, we develop a novel algorithm based on blind probing of a large set of items. This algorithm, in addition to being private and having optimal sample-complexity in many regimes, suggests several interesting open questions, which we discuss in Section 7.
1.2 Related Work
Privacy preserving recommender systems: The design of recommender systems with differential privacy was studied by McSherry and Mironov (2009) under the centralized model. Like us, they separate the recommender system into two components, a learning phase (based on a database appropriately perturbed to ensure privacy) and a recommendation phase (performed by the users ‘at home’, without interacting with the system). They numerically compare the performance of the algorithm against non-private algorithms. In contrast, we consider a stronger notion of privacy (local-DP), and for our generative model, are able to provide tight analytical guarantees and further, quantify the impact of limited information on privacy.
Private PAC Learning and Query Release: Several works have considered private algorithms for PAC-learning. Blum et al. (2008); Gupta et al. (2011) consider the private query release problem (i.e., releasing approximate values for all queries in a given class) in the centralized model. Kasiviswanathan et al. (2008) show equivalences between: a) centralized private learning and agnostic PAC learning, b) local-DP and the statistical query (SQ) model of learning; this line of work is further extended by Beimel et al. (2010). Although some of our results (in particular, Theorem 7) are similar in spirit to lower bounds for PAC (see Kasiviswanathan et al. (2008); Beimel et al. (2010) there are significant differences both in scope and technique. Furthermore:
We emphasize the importance of limited information, and characterize its impact on learning with local-DP. Hitherto unconsidered,information scarcity is prevalent in practical scenarios, and as our results shows, it has strong implications on learning performance under local-DP.
Privacy in Statistical Learning: A large body of recent work has looked at the impact of differential privacy on statistical learning techniques. A majority of this work focusses on centralized differential privacy. For example, Chaudhuri et al. (2011) consider privacy in the context of empirical risk minimization; they analyze the release of classifiers, obtained via algorithms such as SVMs, with (centralized) privacy constraints on the training data.Dwork and Lei (2009) study algorithms for privacy-preserving regression under the centralized model; these however require running time which is exponential in the data dimension. Smith (2011) obtains private, asymptotically-optimal algorithms for statistical estimation, again though, in the centralized model.
More recently, Duchi et al. (2013) consider the problem of finding minimax rates for statistical estimators under local-DP. Their techniques are based on refined analysis of information theoretic quantities, including generalizations of the Fano’s Inequality bounds we use in Section 3.1. However, the estimation problems they consider have a simpler structure – in particular, they involve learning from samples generated directly from an underlying model (albeit privatized). What makes our setting challenging is the combination of a generative model (the bipartite stochastic blockmodel) with incomplete information (due to user-item sampling) – it seems unlikely that the techniques of Duchi et al. (2013) can extend easily to our setting. Moreover, lower bound techniques do not naturally yield good algorithms
Other Notions of Privacy: The local-DP model which we consider has been studied before in privacy literature (Kasiviswanathan et al. (2008); Dwork et al. (2006)) and statistics (Warner (1965)). It is a stronger notion than central differential privacy, and also stronger than two other related notions: pan-privacy (Dwork et al. (2010b)) where the database has to also deal with occasional release of its state, and privacy under continual observations (Dwork et al. (2010a)), where the database must deal with additions and deletions, while maintaining privacy.
Recommendation algorithms based on incoherence: Apart from privacy-preserving algorithms, there is a large body of work on designing recommender systems under various constraints (usually low-rank) on the ratings matrix (for example, Wainwright (2009); Keshavan et al. (2010)). These methods, though robust, fail in the presence of privacy constraints, as the noise added as a result of privatization is much more than their noise-tolerance. This is intuitive, as successful matrix completion would constitute a breach of privacy; our work builds the case for using simpler lower dimensional representations of the data, and simpler algorithms based on extracting limited information (in our case, -bit sketches) from each user.
We now present our system model, formally define different notions of differential privacy, and introduce some tools from information theory that form the basis of our proofs.
2.1 The Bipartite Stochastic BlockModel
Recommender system typically assume the existence of an underlying low-dimensional generative model for the data – the aim then is to learn parameters of this model, and then, use the learned model to infer unknown user-item rankings. In this paper we consider a model wherein items and users belong to underlying clusters, and a user’s ratings for an item depend only on the clusters they belong to. This is essentially a bipartite version of the Stochastic Blockmodel Holland et al. (1983), widely used in model selection literature. The aim of the recommendation algorithm is to learn these clusters, and then reveal them to the users, who can then compute their own recommendations privately. Our model, though simpler than the state of the art in recommender systems, is still rich enough to account for many of the features seen empirically in recommender systems. In addition it yields reasonable accuracy in non-private settings on meaningful datasets (c.f. Tomozei and Massoulié (2011)).
Formally, let be the set of users and the set of items. The set of users is divided into clusters , where cluster contains users. Similarly, the set of items is divided into clusters , where cluster contains items. We use to denote the (incomplete) matrix of user/item ratings, where each row corresponds to a user, and each column an item. For simplicity, we assume ; for example, this could correspond to ‘like/dislike’ ratings. Finally we have the following statistical assumption for the ratings – for user with user class , and item with item class , the rating is given by a Bernoulli random variable . Ratings for different user-item pairs are assumed independent.
In order to model limited information, i.e., the fact that users rate only a fraction of all items, we define a parameter to be the number of items a user has rated. More generally, we only need to know in an orderwise sense – for example, for some function . We assume that the rated items are picked uniformly at random. We define to be the information-rich regime, and to be the information-scarce regime.
Given this model, the aim of the recommender system is to learn the item-clusters from user-item ratings. Note that the difficulty in doing so is twofold:
The user-item ratings matrix is incomplete – in particular, each user has ratings for only out of items.
Users share their information only via a privacy-preserving mechanism (as we discuss in the next section).
Our work exposes how these two factors interact to affect the sample-complexity, i.e., the minimum number of users required to learn the item-clusters. We note also that another difficulty in learning is that the user-item ratings are noisy – however, as long as this noise does not depend on the number of items, this does not affect the sample-complexity scaling.
2.2 Differential Privacy
Differential privacy is a framework that defines conditions under which an algorithm can be said to be privacy preserving with respect to the input. Formally (following Dwork (2006)):
(-Differential Privacy) A randomized function that maps data to is said to be -differentially private if, for all values in the range space of , and for all ‘neighboring’ data , we have:
We assume that conditioned on is independent of any external side information (in other words, the output of mechanism depends only on and its internal randomness). The definition of ‘neighboring’ is chosen according to the situation, and determines the data that remain private. In the original definition Dwork (2006), two databases are said to be neighbors if the larger database is constructed by adding a single tuple to the smaller database. In the context of ratings matrices, two matrices can be neighbors if they differ in: a single row (per-user privacy), or a single rating (per-rating privacy).
Two crucial properties of differential privacy are composition and post-processing. We state these here without proof; c.f. Dwork (2006) for details. Composition captures the reduction in privacy due to sequentially applying multiple differentially-private release mechanisms:
(Composition) If outputs, are obtained from data by different randomized functions, , where is -differentially private, then the resultant function is differentially private.
Post-processing states that processing the output of a differentially private release mechanism can only make it more differentially private (i.e., with a smaller ) vis-a-vis the input:
(Post-processing) If a function is -differentially private, then any composition function is -differentially private for some .
In settings where the database curator is untrusted, an appropriate notion of privacy is local differential privacy (or local-DP). For each user , let be its private data – in the recommendation context, the rated-item labels and corresponding ratings – and let be the data that the user makes publicly available to the untrusted curator. Local-DP requires that is differentially private w.r.t. . This paradigm is similar to the Randomized Response technique in statistics Warner (1965). It is the natural notion of privacy in the case of untrusted databases, as the data is privatized at the user-end before storage in the database; to emphasize this, we alternately refer to it as User-end Differential Privacy.
We conclude this section with a mechanism for releasing a single bit under -differential privacy. Differential privacy for this mechanism is easy to verify using equation 1.
(-DP bit release): Given bit , set output to be equal to with probability , else equal to . Then is -differentially private w.r.t. .
2.3 Preliminaries from Information Theory
For a random variable taking values in some discrete space , its entropy is defined as 333For notational convenience, we use as the logarithm to the base throughout; hence, the entropy is in ‘bits’. For two random variables , the mutual information between them is given by:
Our main tools for constructing lower bounds are variants of Fano’s Inequality, which are commonly used in non-parametric statistics literature (c.f. Santhanam and Wainwright (2009); Wainwright (2009)). Consider a finite hypothesis class , indexed by . Suppose that we choose a hypothesis uniformly at random from , sample a data set of samples drawn in an i.i.d. manner according to a distribution (in our case, corresponds to a user, and the ratings drawn according to the statistical model in Section 2.1), and then provide a private version of this data to the learning algorithm. We can represent this as the Markov chain:
Further, we define a given learning algorithm to be unreliable for the hypothesis class if for a hypothesis drawn uniformly at random, we have .
Fano’s inequality provides a lower bound on the probability of error under any learning algorithm in terms of the mutual information between the underlying hypotheses and the samples. A basic version of the inequality is as follows:
(Fano’s Inequality) Given a hypothesis drawn uniformly from , and samples drawn according to , for any learning algorithm, the average probability of error satisfies:
As a direct consequence of this result, if the samples are such that , then any algorithm fails to correctly identify almost all of the possible underlying models. Though this is a weak bound, equation 2 turns out to be sufficient to study sample-complexity scaling in the cases we consider. In Appendix A, we consider stronger versions of the above lemma, as well as more general criterion for approximate model selection (e.g., allowing for distortion).
3 Item-Clustering under Local-DP: The Information-Rich Regime
In this section, we derive a basic lower bound on the number of users needed for accurate learning under local differential privacy. This relies on a simple bound on the mutual information between any database and its privatized output, and hence is applicable in general settings. Returning to item-clustering, we give an algorithm that matches the optimal scaling (up to logarithmic factor) under one of the following two conditions: , i.e., each user has rated a constant fraction of items (the information-rich regime), or only the ratings are private, not the identity of the rated items.
3.1 Differential Privacy and Mutual Information
We first present a lemma that characterizes the mutual information leakage across any differentially private channel:
Given (private) r.v. , a privatized output obtained by any locally DP mechanism , and any side information , we have:
Lemma 6 follows directly from the definitions of mutual information and differential privacy (note that for any such mechanism, the output given the input is conditionally independent of any side-information). We note that similar results have appeared before in literature; for example, equivalent statements appear in McGregor et al. (2010); Alvim et al. (2011). We present the proof here for the sake of completeness:
Proof [Proof of Lemma 6]
Here inequality is a direct application of the definition of differential privacy (Equation 1), and in particular, the fact that it holds for any side information.
Returning to the private learning of item classes, we obtain a lower bound on the sample-complexity by considering the following special case of the item-clustering problem: consider , and let be a mapping of the item set to two classes represented as – hence the size of the hypothesis class is . Each user has some private data , which is generated via the bipartite Stochastic Blockmodel (c.f., Section 2.1). Recall we define a learning algorithm to be unreliable for if . Using Lemma 6 and Fano’s inequality (Lemma 5), we get the following lower bound on the sample-complexity:
Suppose the underlying clustering is drawn uniformly at random from . Then any learning algorithm obeying -local-DP is unreliable if the number of queries satisfies: .
Proof We now have the following information-flow model for each user (under local-DP):
Fano’s inequality (Lemma 5) then implies that a learning algorithm is unreliable if the number of queries satisfies: .
We note here that the above theorem, though stated for the bipartite Stochastic Blockmodel, in fact gives sample-complexity bounds for more general model-selection problems. Further, in Appendix A, we extend the result to allow for distortion – wherein the algorithm is allowed to make a mistake on some fraction of item-labels.
For the bipartite Stochastic Blockmodel, though the above bound is not the tightest, it turns out to be achievable (up to log factors) in the information-rich regime, as we show next. We note that a similar bound was given by Beimel et al. (2010) for PAC-learning under centralized DP, using more explicit counting techniques. Both our results and the bounds in Beimel et al. (2010) fail to exhibit the correct scaling in the information-scarce case () setting. However, unlike proofs based on counting arguments, our method allows us to leverage more sophisticated information theoretic tools for other variants of the problem, like those we consider subsequently in Section 4.
3.2 Item-Clustering in the Information-Rich Regime
To conclude this section, we outline an algorithm for clustering in the information-rich regime. The algorithm proceeds as follows: the recommendation algorithm provides each user with two items picked at random, whereupon the user computes a private sketch which is equal to if she rated the two items positively, and else , users release a privatized version of their private sketch using the -DP bit release mechanism, the algorithm constructs matrix , where entry is obtained by adding the sketches from all users queried with item-pair , and finally performs spectral clustering of items based on matrix . This algorithm, which we refer to as the Pairwise-Preference algorithm, is formally specified in Figure 1.
Setting: Items , Users . Each user has set of ratings . Each item associated with a cluster .
Return: Cluster labels
Stage 1 (User sketch generation):
For each user , pick items :
At random if
If is known, pick two random rated items.
User generates a private sketch given by:
where if , and otherwise.
Stage 2 (User sketch privatization):
Each user releases privatized sketch from using the -DP bit release mechanism (Proposition 4).
Stage 3 (Spectral Clustering):
Generate a pairwise-preference matrix , where:
Extract the top normalized eigenvectors (corresponding to largest eigenvalues of ).
Project each row of into the -dimensional profile space of the top eigenvectors.
Perform k-means clustering in the profile space to get the item clusters
Recall in the bipartite Stochastic Blockmodel, we assume that the users belong tp clusters, each of size . We now have the following theorem that characterizes the performance of the Pairwise-Preference algorithm.
The Pairwise-Preference algorithm satisfies -local-DP. Further, suppose the eigenvalues and eigenvectors of satisfy the following non-degeneracy conditions:
The largest magnitude eigenvalues of have distinct absolute values.
The corresponding eigenvectors , normalized under the -norm, , for some satisfy:
Then, in the information-rich regime (i.e., when ), there exists such that the item clustering is successful with high probability if the number of users satisfies:
Proof [Proof Outline]
Local differential privacy under the Pairwise-Preference algorithm is guaranteed by the use of -DP bit release, and the composition property. The performance analysis is based on a result on spectral clustering by Tomozei and Massoulié (2011). The main idea is to interpret as representing the edges of a random graph over the item set, with an edge between an item in class and another in class if . In particular, from the definition of the Pairwise Preference algorithm, we can compute that the probability of such an edge is . This puts us in the setting analyzed by Tomozei and Massoulié (2011) – we can now use their spectral clustering bounds to get the result. For the complete proof, refer Appendix B.
4 Local-DP in the Information-Scarce Regime: Lower Bounds
As in the previous lower bound, we consider a simplified version of the problem, where there is a single class of users, and each item is ranked either or deterministically by each user (i.e., for all items). Let be the underlying clustering function; in general we can think of this as an -bit vector . We assume that the user-data for user is given by , where is a size subset of representing items rated by user , and are the ratings for the corresponding items; in this case, . The set is assumed to be chosen uniformly at random from amongst all size- subsets of . We also denote the privatized sketch from user as . Here the space to which sketches belong is assumed to be an arbitrary finite or countably infinite space. The sketch is assumed -differentially private. Finally, as before, we assume that is chosen uniformly over . Thus we have the following information-flow model for the user :
Now to get tighter lower bounds on the number of users needed for accurate item clustering, we need more accurate bounds on the mutual information between the underlying model on item-clustering and the data available to the algorithm. The main idea behind our lower bound techniques is to view the above chain as a combination of two channels – the first wherein the user-data is generated (sampled) by the underlying statistical model, and the second wherein the algorithm receives a sketch of the user’s data. We then develop a new information inequality that allows us to bound the mutual information in terms of the mismatch between the channels. This technique turns out to be useful in settings without privacy as well – in Section 4.2, we show how it can be used to get sample-complexity bounds for learning with -bit sketches.
4.1 Mutual Information under Channel Mismatch
We now establish a bound for the mutual information between a statistical model and a low-dimensional sketch, which is the main tool we use to get sample-complexity lower bounds. We define to be the collection of all size- subsets of , and to be the set from which user information (i.e., ) is drawn, and define . Finally indicates that the expectation is over the random variable .
Given the Markov Chain , let be two pairs of ‘user-data’ sets which are independent and identically distributed according to the conditional distribution of the pair given . Then, the mutual information satisfies:
where we use the notation to denote that the two user-data sets are consistent on the index set on which they overlap, i.e.,
Proof For brevity, we use the shorthand notation and finally . Now we have:
Let . Similar to above, we use the shorthand notation , where are random variables and their corresponding realizations. Now we have:
|(Summing over )|
|(By the Markov property)|
where the last equality is obtained using the fact that the type of each item is independent and uniformly distributed over . Next, using a similar set of steps, we have:
We note here that the above lemma is a special case (where takes the uniform measure over ) of a more general lemma, which we state and prove in Appendix C
4.2 Sample-Complexity for Learning with -bit Sketches
To demonstrate the use of Lemma 9, we first consider a related problem that demonstrates the effect of per-user constraints (as opposed to average constraints) on the mutual information. We consider the same item-class learning problem as before with (i.e., each user has access to a single rating), but instead of a privacy constraint, we consider a ‘per-user bandwidth’ constraint, wherein each user can communicate only a single bit to the learning algorithm.
Suppose , with drawn i.i.d uniformly over . Then for any -bit sketch derived from , it holds that: and consequently, there exists a constant such that any cluster learning algorithm using queries with -bit responses is unreliable if the number of users satisfies
Proof In order to use Lemma 9, we first note that is a convex function of for fixed (Theorem in Cover and Thomas (2006)). Writing as , we observe that the extremal points of the kernel correspond to , where the mutual information is maximized. This implies that the class of deterministic queries with -bit response that maximizes mutual information has the following structure: given user-data , the algorithm provides user with an arbitrary set of (items,ratings), and the user identifies if is contained in . Formally, the query is denoted (i.e., is ).
Defining , for a query response , we have the following:
and similarly where is the complement of set . From Lemma 9, for r.v.s , we have:
Introducing the notation , the following identity is easily established:
The RHS of (6) is a non-negative definite quadratic form of the variables (since ). Thus:
where if and if . Now for a given , consider the partitioning of the set into , where for , . We then have the following:
Now, using Fano’s inequality (Lemma 5) to get the result.
Note that the above bound is tight – to see this, consider a (adaptive) scheme where each user is asked a random query of the form “Is ?”(where and ). The average time between two successful queries is , and one needs successful queries to learn all the bits. This demonstrates an interesting change in the sample-complexity of learning with per-user communications constraints (-bit sketches in this section, privacy in next section) versus average-user constraints (mutual information bound or average bandwidth).
4.3 Sample-Complexity for Learning under Local-DP
We now exploit the above techniques to obtain lower bounds on the scaling required for accurate clustering with DP in an information-scarce regime, i.e., when . To do so, we first require a technical lemma that establishes a relation between the distribution of a random variable with and without conditioning on a differentially private sketch:
Given a discrete random variable and some -differentially private ‘sketch’ variable generated from , there exists a function such that for any and :
Thus, we can define:
Further, from the definition of -DP, we have:
and hence we have .
Recall we define to be the set from which user information is drawn. We write for the base probability distribution on and (note: the two are i.i.d uniform) over , and denote by mathematical expectation under . We also need the following estimate (c.f. Appendix C for the proof):
If , then:
In the information-scarce regime, i.e., when , under -local-DP we have:
and consequently, there exists a constant such that any cluster learning algorithm with -local-DP is unreliable if the number of users satisfies
Proof To bound the mutual information between the underlying model and each private sketch, we use Lemma 9. In particular, we show that the mutual information is bounded by for any given value of the private sketch.