Passive and active attackers in noiseless privacy
Differential privacy offers clear and strong quantitative guarantees for privacy mechanisms, but it assumes an attacker that knows all but one records of the dataset. This assumption leads in many applications to an overapproximation of an attacker’s actual strength. This can lead to over-cautiously chosen noise parameters, which can in turn lead to unnecessarily poor utility.
Recent work has made significant steps towards privacy in the presence of limited background knowledge. Definitions like noiseless privacy are modeling a weaker attacker who has uncertainty about the data. In this technical report, we show that existing definitions implicitly assume an active attacker, who can not only see, but control some of the data. We argue that in some cases, it makes sense to consider a passive attacker, who has some influence over the data. We propose an alternative definition to capture this attacker model, we study the link between these two alternatives, and we show some consequences of this distinction on the security of thresholding schemes.
Differential privacy is a strong quantitative privacy notion, particularly useful in scenarios with competing utility and privacy requirements. A typical scenario is querying a sensitive database: the query-issuer, who may be adversarial, is allowed to send statistical queries and expects accurate responses (utility). However, the adversary should learn as little as possible about individual entries in the database (privacy). To account for potential side information about the sensitive database or capability to influence the data, differential privacy quantifies the maximum difference in query results between two databases that differ in only one element. As such, the adversary is assumed to be very strong: she implicitly knows all data in the database, except one single record (her target).
This assumption of almost-complete background knowledge is, in many contexts, unrealistically strong and vastly overapproximates most adversaries’ actual capabilities. In most practical scenarios, if the attacker already has perfect knowledge about almost all elements of a sensitive dataset, she then has access to the dataset itself. In such a case, the sensible goal is to protect the dataset to prevent this from happening, not protecting individual records. It is natural to take into account that a strong attacker might have some level of background knowledge, but it is also reasonable to assume that this knowledge is limited.
Weakening differential privacy by assuming that the adversary has limited background knowledge can significantly increase the utility of statistical mechanisms, while still providing guarantees about their privacy. This idea is not new, and the literature contains some initial results about differential privacy in the presence of limited background knowledge, showing that even noiseless mechanisms can be shown to satisfy weakened privacy requirements.
The main definition used when limiting the attacker’s background knowledge is noiseless privacy, an idea that was first proposed in [duan2009privacy], and formalized in [bhaskar2011noiseless]. This definition and its variants, like pufferfish privacy [kifer2012rigorous], allow to model an attacker with partial knowledge. Such an attacker might have some auxiliary information about the data, or know some of the records, but she still has some degree of uncertainty about the data.
In this work, we show that there are two possible ways to model this partial knowledge, depending on whether the attacker is active (able to influence the data) or passive (unable to influence the data). We show that the former case is implicitly used in noiseless privacy, and define a new variant to capture the latter case. We show basic properties of these two variants, and we show that in certain conditions, the two notions coincide. Further, we show using the example of thresholding that these two notions can be extremely different, which shows the importance of distinguishing them.
2 Related work
Differential privacy was initially proposed by Dwork et al. [dwork2006calibrating], and is based on the indistinguishability between probability distributions.
Many variants have been proposed since then, and many of these model an adversary with limited background knowledge. Among those, two use this concept of indistinguishability to model privacy: noiseless privacy [duan2009privacy, bhaskar2011noiseless], and distributional differential privacy [bassily2013coupled]. In this work, we focus on the former.
Other variants, which also model adversaries with limited background knowledge, are not based on indistinguishability. Instead, they directly constrain the posterior knowledge of an attacker as a function of their prior knowledge. Among those, adversarial privacy [rastogi2009relationship], membership privacy [li2013membership], and aposteriory noiseless privacy [bhaskar2011noiseless]. Some relations already exist [bhaskar2011noiseless, li2013membership] between these notions and noiseless privacy, so we do not study them in detail.
Several other definitions have been proposed; in particular, pufferfish privacy [kifer2012rigorous] can be seen as a generalization of noiseless privacy: instead of protecting individual tuples, it protects arbitrary sensitive properties of the data. It is straightforward to generalize the results in this paper to this more generic framework.
In all the following, designates the set of possible databases. A dataset is a family of records: , where each is in a fixed set of possible records, and is the size of the set . We only consider databases of fixed size , and usually do not explicit the range of database indices . Mechanisms, typically denoted , take databases as input, and output some value in an output space .
First, we recall the original definitions of -indistinguishability and -differential privacy.
Definition 1 (-indistinguishability [dwork2006calibrating]).
Two random variables and are -indistinguishable if for all measurable sets of possible events:
We denote this . If , we call this -indistinguishability, and denote it .
Definition 2 (-differential privacy [dwork2006differential]).
A privacy mechanism is -differentially private if for any datasets and that differ only on the data of one individual, . If , the mechanism is said to be -differentially private.
Assuming that the attacker lacks background knowledge can be represented by considering the input data as noisy. Instead of comparing two databases that differ in only one record, it is natural to compare two probability distributions, conditioned on the value of one record. This idea was first proposed in [duan2009privacy], and formalized in [bhaskar2011noiseless] as noiseless privacy.
Definition 3 (-noiseless privacy [bhaskar2011noiseless]).
Given a family of probability distribution on , a mechanism is -noiseless private if for all , all and all , .
Note that this definition does not make the background knowledge explicit. In the original definition, an additional parameter was defined to capture the attacker’s auxiliary knowledge, and all probabilities were conditioned on . However, if is fixed, it is equivalent to consider a different distribution , corresponding to conditioned on .
4 Partial knowledge
The original definition of noiseless privacy has only one possible value of the background knowledge, which makes it possible to remove it entirely and make it implicit in . However, this representation is not very practical: it is typically impossible to know exactly what knowledge the attacker has. Instead, one might want to protect against a class of possible attackers, who might have some limited knowledge of the data. To capture this, we use a convention similar to [bassily2013coupled], and consider a family of background knowledge functions : a possible attacker can be modeled by a function . Given a random database , the attacker receives as background knowledge information.
How can we model the knowledge gain of the attacker using this formalism? We first present the definition of the privacy loss random variable, a concept corresponding to this knowledge gain in classical differential privacy, and adapt it to a limited background knowledge context. Then, we explain how it can be used to derive two definitions of privacy under partial knowledge, depending on whether the attacker is active or passive.
4.1 Privacy loss random variable
The privacy loss random variable was first defined in [dinur2003revealing], and measures the knowledge gain of an attacker who observes the output of a privacy mechanism. For simplicity, we only consider the case where is a countable set.
Definition 4 (Privacy Loss Random Variable).
Given a mechanism , and two datasets and , the privacy loss random variable (PLRV) of an output is defined as
if neither nor is 0; in case only is zero then , otherwise . When the mechanism is clear from context, we write just .
Let us rephrase classical differential privacy using the PLRV. This lemma can be found e.g. in Lemma 1 in [MeMo_18:bucketing].
Lemma 1 ([MeMo_18:bucketing, Lemma 1]).
A mechanism is -DP if for all adjacent databases
, for all such as
A mechanism is -DP if for all adjacent databases , the value defined as:
is lower or equal to . This value is -tight in the sense that, for a given , is the smallest such that is -DP.
Let us now consider the case where the attacker only has partial knowledge about the data. How do we model this partial knowledge, and adapt the definition of privacy? First, let us adapt the definition of the privacy loss random variable to this new setting. Recall that model the attacker’s partial knowledge by a function that takes a dataset as input and returns the corresponding background knowledge . We still assume that the attacker is trying to distinguish between and by observing output . But this time, the database is generated from an unknown distribution , and the attacker also has a certain background knowledge , a new parameter in the PLRV definition. We need to condition all probabilities on .
Definition 5 (PLRV for partial knowledge).
Given a mechanism , a distribution , a function , an index , values , the PLRV of an output and a possible value of the partial knowledge is:
if the three conditions are satisfied:
If condition 1 does not hold, then . Else, if condition 2 does not hold, then . Else, if condition 3 does not hold, then .
The PLRV for partial knowledge captures the same idea as for classical differential privacy: it quantifies the “amount of information” gained by this attacker. When , it is exactly the same definition as the classical PLRV.
Note that the condition 1 above was not present in the original PLRV definition. What does this situation correspond to? It can happen for two distinct reasons.
can be neither nor when . Comparing the possibilities and is pointless since they are both impossible.
When , is possible, but not (or vice-versa). In this case, is a sort of “distinguishing event”: the attacker can immediately distinguish between the two possibilities, using only their background knowledge. But the PLRV quantifies the information gain due to the mechanism. If the attacker no longer has any uncertainty, the mechanism can neither increase nor decrease their knowledge.
In both cases, the convention that the PLRV is is the only reasonable option.
Now that we translated the concept of PLRV from the classical DP context to the situation where the attacker only has partial knowledge, we can use this to adapt the privacy definition. The formula in Lemma 1 averages the PLRV over all possible outputs , but the PLRV with partial knowledge has a second parameter, . How to handle this new parameter? We show that there is at least two possibilities, both of which are sensible.
4.2 Active partial knowledge
The first option to handle partial knowledge is to quantify over all possibilities for the background knowledge of the attacker: we assume the worst, and consider the case where the attacker has the background knowledge with which the privacy loss will be the greatest. This models practical scenarios in which the attacker can not only see, but influence parts of the data. If the attacker can e.g. add fake users to the database, then they can choose the values associated to these new records, and they will maximize their chances of information gain. Therefore, we call this option active partial knowledge.
Definition 6 (Apk-Dp).
Given a family of distributions and a family of functions , a mechanism is said to be -APK-DP (Active Partial Knowledge Differential Privacy) if for all distributions , all functions , all indices , all , and all possible values of the background knowledge :
Quantifying over all possible values of the background knowledge is equivalent to letting the attacker choose the worst possible value. This assumption is realistic in certain cases: for example, if an online service releases statistics about their users, an attacker could create fake users and interact with the service, adding arbitrary data to the statistics. In this case, it might also be realistic to consider that some of the users are not fake; and as such, that the attacker only has a background knowledge limited to the fake users they control.
Note that this definition admits an equivalent definition, using classical -indistinguishability.
A mechanism is -APK-DP if for all distributions , all functions , all indices , all , all possible values of the background knowledge :
The proof is the same as in [MeMo_18:bucketing, Lemma 1]. ∎
This formulation makes it explicit that APK-DP is the same as noiseless privacy [bhaskar2011noiseless], in its reformulation in [bassily2013coupled]. One important thing to note is that in this context, explicitly modeling the background knowledge by a family of functions is not necessary. We can simply condition each probability distribution by the value of each possible background knowledge function , and obtain an equivalent definition.
A mechanism is -APK-DP if it is -APK-DP, where is the set of all distributions such that for each :
This follows immediately from Lemma 2. ∎
4.3 Passive partial knowledge
We saw before that the existing definitions of noiseless privacy implicitly assume an active attacker, who can influence the data. In certain contexts, this assumption is unrealistically strong: for exaxmple, consider a government census releasing some data about citizens. Physical surveys and identity verification might make it impossible for an attacker to inject fake data into the census. However, an attacker might use some auxiliary information to deduce the value of certain records in the data.
When the attacker has some partial information, but cannot influence the data, it no longer makes sense to quantify over all possible values of the background knowledge. In the same way that the reformulation of -DP using the PLRV averages the PLRV over all possible outputs, we need to average the PLRV over all possible values of the background knowledge.
Definition 7 (Ppk-Dp).
Given a family of distributions and a family of functions , a mechanism is said to be -PPK-DP if for all distributions , all functions , all indices , and all :
In this context, the term has a similar meaning as in -differential privacy. In -DP, the term captures the probability that the attacker is lucky, and that the mechanism returned a favorable output . Here, “favorable” means that this output allows them to distinguish between two possible databases with high probability, or, equivalently, that the privacy loss associated to output is large. In -PPK-DP however, the captures the probability of the attacker to get either a favorable output , or a favorable background knowledge .
Note that there is no equivalent of Lemma 2 for PPK-DP. Since the statement conditions both probabilities, using -indistinguishability only applies the to the randomness of . To reformulate this definition using indistinguishability, we could i.e. require that -indistinguishability holds with probability over the choice of .
4.4 Relation between definitions
What is the relation between PPK and APK? In this section, we show that as expected, an active attacker is stronger than a passive attacker: if a mechanism protects against an active attacker, then it also protects against a passive attacker. Then, we show that this distinction can only be strict when , and we give a concrete example of a simple mechanism that protects against passive attackers, but not against active attackers.
Note that given , -APK bounds the probability mass of the privacy loss random variable for a fixed value of the background knowledge , quantifying over all such values. PPK-DP, on the other hand, bounds the same probability mass, but averaged over all possible values of the background knowledge. This average is weighted according to the likelihood of each possible value of . We formalize this intuition in the lemma below.
Given a distribution , a mechanism , a function , an index , two values and , let us denote . Consider the quantities bounded by the requirements of APK and PPK, respectively:
Note that APK bounds for all values of the background knowledge , while PPK bounds . Then:
The most natural way to compute is to first generate , and then generate and :
where the only random variable is taken over . The simplification comes from the fact that and only depend on , so fixing removes all randomness except the one inherent to and . More interestingly, can also be rewritten by fixing the value of first:
using the same simplification. The right-hand term is known:
This natural interpretation allows us to deduce natural properties of active vs. passive background knowldege DP. First, unsurprisingly, an active attacker is stronger than a passive one: APK-DP is stronger than PPK-DP.
Given a family of distributions and a family of functions , if a mechanism is -APK-DP, then it is also -PPK-DP.
If the mechanism is -APK-DP, then for all , , , and , . So:
This implication is strict: the privacy loss random variable can be arbitrarily higher for certain values of the background knowledge. Thus, quantifying over all possible values can lead to much larger values of and than averaging over all possible values of the background knowledge. We illustrate this phenomenon in the following example.
Example 1 (Thresholding).
1000 people take part in a Yes/No-referendum, independently from each other. Each person votes “Yes” with the same probability and this probability is very low, we set it to . The mechanism for this example counts the number of “Yes”-votes but returns it only if it is above 100. Otherwise it returns 0. Furthermore, let be the function that simply returns the votes of the first 100 participants (), and . We assume that the attacker wants to know the vote of an unknown individual , for .
If the attacker is a passive one, this example is private. Indeed, almost certainly, the mechanism outputs 0 and therefore gives no information: the background knowledge is also going to consist mainly of “No” votes and even if there are “Yes” votes in the first 100 participants’ votes, they are highly unlikely to be more than a few. An active attacker, on the other hand can simply add many fake “Yes” votes to the database, to reach the threshold. The mechanism then becomes a simple counting mechanism which does not provide privacy: with high probability, everybody voted “No”, and the only uncertainty left is over the attacker’s target.
Second, the only difference between APK-DP and PPK-DP is in the interpretation of the probability mass . Without the , -APK-DP and -PPK-DP are both worst-case properties, like -DP. The ability for an attacker to choose the background knowledge does not matter, since even for the passive attacker, we consider the worst possible output and background knowledge . Thus, if , APK-DP and PPK-DP are equivalent.
Given a family of distributions and a family of functions , if a mechanism is -PPK-DP, then it is also -APK-DP. Furthermore, both definitions are equivalent to the statement: for all , , , and :
By Lemma 5, we know that
holds as all factors are nonnegative; hence, the sum can only be if all factors are . Hence, for all
holds. Moreover, for all
|holds as all factors are nonnegative. As , we know that|
As the precondition holds for all . The same statement holds for . ∎
We showed that when modeling an attacker with limited background knowledge, there were two options to consider, depending on whether the attacker has the ability to influence the data. We proposed two variants of noiseless privacy that make this distinction explicit: passive partial knowledge differential privacy (PPK-DP) and active partial knowledge differential privacy (APK-DP). Our privacy notions clearly separate the case where the attacker is able to influence some of the data (that is then used as background knowledge), and the case where they receive some random background knowledge without being able to influence it.
A natural direction for future work is the analysis of various natural mechanisms under PPK-DP and APK-DP; in particular of mechanisms for which the two definitions are not equivalent. Further, as our new notions are formulated with the privacy loss random variable (PLRV), our approach can be extended to other recently introduced privacy notions, such as Rényi Differential Privacy or Concentrated Differential Privacy: we leave the extension of existing results using these variants to future work.