MemGuard: Defending against BlackBox Membership Inference Attacks via Adversarial Examples
Abstract.
In a membership inference attack, an attacker aims to infer whether a data sample is in a target classifier’s training dataset or not. Specifically, given a blackbox access to the target classifier, the attacker trains a binary classifier, which takes a data sample’s confidence score vector predicted by the target classifier as an input and predicts the data sample to be a member or nonmember of the target classifier’s training dataset. Membership inference attacks pose severe privacy and security threats to the training dataset. Most existing defenses leverage differential privacy when training the target classifier or regularize the training process of the target classifier. These defenses suffer from two key limitations: 1) they do not have formal utilityloss guarantees of the confidence score vectors, and 2) they achieve suboptimal privacyutility tradeoffs.
In this work, we propose MemGuard, the first defense with formal utilityloss guarantees against blackbox membership inference attacks. Instead of tampering the training process of the target classifier, MemGuard adds noise to each confidence score vector predicted by the target classifier. Our key observation is that attacker uses a classifier to predict member or nonmember and classifier is vulnerable to adversarial examples. Based on the observation, we propose to add a carefully crafted noise vector to a confidence score vector to turn it into an adversarial example that misleads the attacker’s classifier. Specifically, MemGuard works in two phases. In Phase I, MemGuard finds a carefully crafted noise vector that can turn a confidence score vector into an adversarial example, which is likely to mislead the attacker’s classifier to make a random guessing at member or nonmember. We find such carefully crafted noise vector via a new method that we design to incorporate the unique utilityloss constraints on the noise vector. In Phase II, MemGuard adds the noise vector to the confidence score vector with a certain probability, which is selected to satisfy a given utilityloss budget on the confidence score vector. Our experimental results on three datasets show that MemGuard can effectively defend against membership inference attacks and achieve better privacyutility tradeoffs than existing defenses. Our work is the first one to show that adversarial examples can be used as defensive mechanisms to defend against membership inference attacks.
1. Introduction
Machine learning (ML) is transforming many aspects of our society. We consider a model provider deploys an ML classifier (called target classifier) as a blackbox software or service, which returns a confidence score vector for a query data sample from a user. The confidence score vector is a probability distribution over the possible labels and the label of the query data sample is predicted as the one that has the largest confidence score. Multiple studies have shown that such blackbox ML classifier is vulnerable to membership inference attacks (Shokri et al., 2017; Nasr et al., 2019; Salem et al., 2019b; Song et al., 2019). Specifically, an attacker trains a binary classifier, which takes a data sample’s confidence score vector predicted by the target classifier as an input and predicts whether the data sample is a member or nonmember of the target classifier’s training dataset. Membership inference attacks pose severe privacy and security threats to ML. In particular, in application scenarios where the training dataset is sensitive (e.g., biomedical records and location traces), successful membership inference leads to severe privacy violations. For instance, if an attacker knows her victim’s data is used to train a medical diagnosis classifier, then the attacker can directly infer the victim’s health status. Beyond privacy, membership inference also damages the model provider’s intellectual property of the training dataset as collecting and labeling the training dataset may require lots of resources.
Therefore, defending against membership inference attacks is an urgent research problem and multiple defenses (Shokri et al., 2017; Nasr et al., 2018; Salem et al., 2019b) have been explored. A major reason why membership inference attacks succeed is that the target classifier is overfitted. As a result, the confidence score vectors predicted by the target classifier are distinguishable for members and nonmembers of the training dataset. Therefore, stateoftheart defenses (Shokri et al., 2017; Nasr et al., 2018; Salem et al., 2019b) essentially regularize the training process of the target classifier to reduce overfitting and the gaps of the confidence score vectors between members and nonmembers of the training dataset. For instance, regularization (Shokri et al., 2017), minmax game based adversarial regularization (Nasr et al., 2018), and dropout (Salem et al., 2019b) have been explored to regularize the target classifier. Another line of defenses (Chaudhuri et al., 2011; Kifer et al., 2012; Iyengar et al., 2019; Song et al., 2013; Bassily et al., 2014; Wang et al., 2017; Abadi et al., 2016; Yu et al., 2019) leverage differential privacy (Dwork et al., 2006) when training the target classifier. Since tampering the training process has no guarantees on the confidence score vectors, these defenses have no formal utilityloss guarantees on the confidence score vectors. Moreover, these defenses achieve suboptimal tradeoffs between the membership privacy of the training dataset and utility loss of the confidence score vectors. For instance, Jayaraman and Evans (Jayaraman and Evans, 2014) found that existing differentially private machine learning methods rarely offer acceptable privacyutility tradeoffs for complex models.
Our work: In this work, we propose MemGuard, the first defense with formal utilityloss guarantees against membership inference attacks under the blackbox setting. Instead of tampering the training process of the target classifier, MemGuard randomly adds noise to the confidence score vector predicted by the target classifier for any query data sample. MemGuard can be applied to an existing target classifier without retraining it. Given a query data sample’s confidence score vector, MemGuard aims to achieve two goals: 1) the attacker’s classifier is inaccurate at inferring member or nonmember for the query data sample after adding noise to the confidence score vector, and 2) the utility loss of the confidence score vector is bounded. Specifically, the noise should not change the predicted label of the query data sample, since even 1% loss of the label accuracy may be intolerable in some critical applications such as finance and healthcare. Moreover, the confidence score distortion introduced by the noise should be bounded by a budget since a confidence score vector intends to tell a user more information beyond the predicted label. We formulate achieving the two goals as solving an optimization problem. However, it is computationally challenging to solve the optimization problem as the noise space is large. To address the challenge, we propose a twophase framework to approximately solve the problem.
We observe that an attacker uses an ML classifier to predict member or nonmember and classifier can be misled by adversarial examples (Carlini and Wagner, 2017; Papernot et al., 2016a, 2017; Szegedy et al., 2013; Papernot et al., 2018, 2016c; Kurakin et al., 2016; Goodfellow et al., 2015). Therefore, in Phase I, MemGuard finds a carefully crafted noise vector that can turn the confidence score vector into an adversarial example. Specifically, MemGuard aims to find a noise vector such that the attacker’s classifier is likely to make a random guessing at inferring member or nonmember based on the noisy confidence score vector. Since the defender does not know the attacker’s classifier as there are many choices, the defender itself trains a classifier for membership inference and crafts the noise vector based on its own classifier. Due to transferability (Szegedy et al., 2013; Kurakin et al., 2016; Liu et al., 2016; Papernot et al., 2016c) of adversarial examples, the noise vector that misleads the defender’s classifier is likely to also mislead the attacker’s classifier. The adversarial machine learning community has developed many algorithms (e.g., (Carlini and Wagner, 2017; Papernot et al., 2016a; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; MoosaviDezfooli et al., 2016; Tramèr et al., 2017; MoosaviDezfooli et al., 2017)) to find adversarial noise/examples. However, these algorithms are insufficient for our problem because they did not consider the unique constraints on utility loss of the confidence score vector. Specifically, the noisy confidence score vector should not change the predicted label of the query data sample and should still be a probability distribution. To address this challenge, we design a new algorithm to find a small noise vector that satisfies the utilityloss constraints.
In Phase II, MemGuard adds the noise vector found in Phase I to the true confidence score vector with a certain probability. The probability is selected such that the expected confidence score distortion is bounded by the budget and the defender’s classifier is most likely to make random guessing at inferring member or nonmember. Formally, we formulate finding this probability as solving an optimization problem and derive an analytical solution for the optimization problem.
We evaluate MemGuard and compare it with stateoftheart defenses (Shokri et al., 2017; Nasr et al., 2018; Salem et al., 2019b; Abadi et al., 2016) on three realworld datasets. Our empirical results show that MemGuard can effectively defend against stateoftheart blackbox membership inference attacks (Nasr et al., 2019; Salem et al., 2019b). In particular, as MemGuard is allowed to add larger noise (we measure the magnitude of the noise using its norm), the inference accuracies of all evaluated membership inference attacks become smaller. Moreover, MemGuard achieves better privacyutility tradeoffs than stateoftheart defenses. Specifically, given the same average confidence score distortion, MemGuard reduces the attacker’s inference accuracy at inferring member/nonmembers by the most.
In summary, our key contributions are as follows:

We propose MemGuard, the first defense with formal utilityloss guarantees against membership inference attacks under the blackbox setting.

We propose a new algorithm to find a noise vector that satisfies the unique utilityloss constraints in Phase I of MemGuard. Moreover, in Phase II, we derive an analytical solution of the probability with which MemGuard adds the noise vector to the confidence score vector.

We evaluate MemGuard on three realworld datasets. Our results show that MemGuard is effective and outperforms existing defenses.
2. Related Work
2.1. Membership Inference
Membership inference attacks: The goal of membership inference is to determine whether a certain data sample is inside a dataset. Homer et al. (Homer et al., 2008) proposed the first membership inference attack in the biomedical setting, in particular on genomic data. Specifically, they showed that an attacker can compare a user’s genomic data with the summary statistics of the target database, such as mean and standard deviation, to determine the presence of the user in the database. The comparison can be done by using statistical testing methods such as loglikelihood ratio test. Later, several works performed similar membership inference attacks against other types of biomedical data such as MicroRNA (Backes et al., 2016) and DNA methylation (Hagestedt et al., 2019). Recently, Pyrgelis et al. (Pyrgelis et al., 2018, 2019) further showed that membership inference can also be performed effectively against location databases. In particular, they showed that an attacker can infer whether a user’s location dataset was used for computing a given aggregate location dataset.
Membership inference attacks against ML models: Shokri et al. (Shokri et al., 2017) introduced membership inference in the ML setting. The goal here is to determine whether a data sample is in the training dataset of a target blackbox ML classifier. To achieve the goal, the attacker trains binary ML classifiers, which take a data sample’s confidence score vector predicted by the target classifier as input and infer the data sample to be a member or nonmember of the target classifier’s training dataset. We call these classifiers attack classifiers and they are trained using shadow classifiers. Specifically, the attacker is assumed to have a dataset coming from the same distribution as the target classifier’s training dataset and the attacker uses the dataset to train shadow classifiers, each of which aims to replicate the target classifier. Then, the attacker trains the attack classifiers by using the confidence score vectors predicted by the shadow classifiers for some members and nonmembers of the shadow classifiers’ training datasets.
Salem et al. (Salem et al., 2019b) recently proposed new membership inference attacks for blackbox target classifiers, which relax the assumptions of the attacks proposed by Shokri et al. from both model and data angles. For instance, they showed that the attacker can rank the entries in a confidence score vector before feeding it into an attack classifier, which improves the attack effectiveness. Moreover, they showed that it is sufficient for the attacker to train just one shadow classifier. These results indicate that membership inference threat is even larger than previously thought.
More recently, Nasr et al. (Nasr et al., 2019) proposed membership inference attacks against whitebox ML models. For a data sample, they calculate the corresponding gradients over the whitebox target classifier’s parameters and use these gradients as the data sample’s feature for membership inference. Moreover, both Nasr et al. (Nasr et al., 2019) and Melis et al. (Melis et al., 2019) proposed membership inference attacks against federated learning. While most of the previous works concentrated on classification models (Shokri et al., 2017; Long et al., 2017, 2018; Nasr et al., 2018; Yeom et al., 2018; Salem et al., 2019b; Nasr et al., 2019), Hayes et al. (Hayes et al., 2019) studied membership inference against generative models, in particular generative adversarial networks (GANs) (Goodfellow et al., 2014). They designed attacks for both white and blackbox settings. Their results showed that generative models are also vulnerable to membership inference.
Defense mechanisms against membership inference: Multiple defense mechanisms have been proposed to mitigate the threat of membership inference in the ML setting. We summarize them as the following.
Regularizer (Shokri et al., 2017). Overfitting, i.e., ML classifiers are more confident when facing data samples they are trained on (members) than others, is one major reason why membership inference is effective. Therefore, to defend against membership inference, people have explored to reduce overfitting using regularization. For instance, Shokri et al. (Shokri et al., 2017) explored using conventional regularizer when training the target classifier.
MinMax Game (Nasr et al., 2018). Nasr et al. (Nasr et al., 2018) proposed a minmax gametheoretic method to train a target classifier. Specifically, the method formulates a minmax optimization problem that aims to minimize the target classifier’s prediction loss while maximizing the membership privacy. This formulation is equivalent to adding a new regularization term called adversarial regularization to the loss function of the target classifier.
Dropout (Salem et al., 2019b). Dropout is a recently proposed technique to regularize neural networks (Srivastava et al., 2014). Salem et al. (Salem et al., 2019b) explored using dropout to mitigate membership inference attacks. Roughly speaking, dropout drops a neuron with a certain probability in each iteration of training a neural network.
Model Stacking (Salem et al., 2019b). Model stacking is a classical ensemble method which combines multiple weak classifiers’ results as a strong one. Salem et al. (Salem et al., 2019b) explored using model stacking to mitigate membership inference attacks. Specifically, the target classifier consists of three classifiers organized into a twolevel tree structure. The first two classifiers on the bottom of the tree take the original data samples as input, while the third one’s input is the outputs of the first two classifiers. The three classifiers are trained using disjoint sets of data samples, which reduces the chance for the target classifier to remember any specific data sample, thus preventing overfitting.
Differential privacy. Differential privacy (Dwork et al., 2006) is a classical method for privacypreserving machine learning. Most differential privacy based defenses add noise to the objective function that is used to learn a model (Chaudhuri et al., 2011; Kifer et al., 2012; Iyengar et al., 2019), or the gradient in each iteration of gradient descent or stochastic gradient descent that is used to minimize the objective function (Song et al., 2013; Bassily et al., 2014; Wang et al., 2017; Abadi et al., 2016; Yu et al., 2019). Shokri and Shmatikov (Shokri and Shmatikov, 2015) designed a differential privacy method for collaborative learning of deep neural networks.
Limitations. Existing defenses suffer from two key limitations: 1) they do not have formal utility loss guarantee of the confidence score vector; and 2) they achieve suboptimal privacyutility tradeoffs. Our defense addresses these two limitations. For instance, as we will show in experiments, with the same utility loss of the confidence score vector (e.g., the same norm distortion of the confidence score vector), our defense reduces the attack classifier’s accuracy at inferring members/nonmembers to a larger extent than existing defenses.
Other privacy/confidentiality attacks against ML: There exist multiple other types of privacy/confidentiality attacks against ML models (Fredrikson et al., 2014, 2015; Ateniese et al., 2013; Ganju et al., 2018; Melis et al., 2019; Tramér et al., 2016; Wang and Gong, 2018; Oh et al., 2018; Salem et al., 2019a). Fredrikson et al. (Fredrikson et al., 2014, 2015) proposed model inversion attacks. For instance, they can infer the missing values of an input feature vector by leveraging a classifier’s prediction on the input feature vector. Several works (Ateniese et al., 2013; Ganju et al., 2018; Melis et al., 2019) studied property inference attacks, which aim to infer a certain property (e.g., the fraction of male and female users) of a target classifier’s training dataset. Tramèr et al. (Tramér et al., 2016) proposed model stealing attacks. They designed different techniques tailored to different ML models aiming at stealing the parameters of the target models. Another line of works studied hyperparameter stealing attacks (Wang and Gong, 2018; Oh et al., 2018), which aim to steal the hyperparameters such as the neural network architecture and the hyperparameter that balances between the loss function and the regularization term.
2.2. Adversarial Examples
Given a classifier and an example, we can add carefully crafted noise to the example such that the classifier predicts its label as we desire. The example with carefully crafted noise is called an adversarial example. Our MemGuard adds carefully crafted noise to a confidence score vector to turn it into an adversarial example, which is likely to mislead the attack classifier to make a random guessing at member or nonmember. The adversarial machine learning community has developed many algorithms (e.g., (Carlini and Wagner, 2017; Papernot et al., 2016a; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; MoosaviDezfooli et al., 2016; Tramèr et al., 2017; MoosaviDezfooli et al., 2017)) to find adversarial examples. However, these algorithms are insufficient to our problem because they did not consider the utilityloss constraints on the confidence score vectors. We address these challenges via designing a new algorithm to find adversarial examples.
Since our defense leverages adversarial examples to mislead the attacker’s attack classifier, an adaptive attacker can leverage a classifier that is more robust against adversarial examples as the attack classifier. Although different methods (e.g., adversarial training (Goodfellow et al., 2015; Tramèr et al., 2017; Madry et al., 2018), defensive distillation (Papernot et al., 2016b), Regionbased Classification (Cao and Gong, 2017), MagNet (Meng and Chen, 2017), and Feature Squeezing (Xu et al., 2018)) have been explored to make classifiers robust against adversarial examples, it is still considered an open challenge to design such robust classifiers. Nevertheless, in our experiments, we will consider the attacker uses adversarial training to train its attack classifier, as adversarial training was considered to be the most empirically robust method against adversarial examples so far (Athalye et al., 2018).
3. Problem Formulation
In our problem formulation, we have three parties, i.e., model provider, attacker, and defender. Table 1 shows some important notations used in this paper.
\topruleNotation  Description 

\midrule  A data sample 
A true confidence score vector  
A noisy confidence score vector  
A noise vector  
Decision function of the target classifier  
Logits of the target classifier  
Attacker’s attack classifier for membership inference  
Decision function of defender’s defense classifier  
Logits of the defender’s defense classifier  
Randomized noise addition mechanism  
Confidence score distortion budget  
\bottomrule 
3.1. Model Provider
We assume a model provider has a proprietary training dataset (e.g., healthcare dataset, location dataset). The model provider trains a machine learning classifier using the proprietary training dataset. Then, the model provider deploys the classifier as a cloud service or a clientside AI software product (e.g., a mobile or IoT app), so other users can leverage the classifier to make predictions for their own data samples. In particular, we consider the deployed classifier returns a confidence score vector for a query data sample. Formally, we have:
where , , and represent the classifier’s decision function, the query data sample, and the confidence score vector, respectively. The confidence score vector essentially is the predicted posterior probability distribution of the label of the query data sample, i.e., is the predicted posterior probability that the query data sample has label . The label of the query data sample is predicted to be the one that has the largest confidence score, i.e., the label is predicted as . For convenience, we call the model provider’s classifier target classifier. Moreover, we consider the target classifier is neural network in this work.
3.2. Attacker
An attacker aims to infer the proprietary training dataset of the model provider. Specifically, we consider the attacker only has blackbox access to the target classifier, i.e., the attacker can send query data samples to the target classifier and obtain their confidence score vectors predicted by the target classifier. The attacker leverages blackbox membership inference attacks (Long et al., 2018; Nasr et al., 2018; Shokri et al., 2017; Salem et al., 2019b) to infer the members of the target classifier’s training dataset. Roughly speaking, in membership inference attacks, the attacker trains a binary classifier, which takes a query data sample’s confidence score vector as input and predicts whether the query data sample is in the target classifier’s training dataset or not. Formally, we have:
where is the attacker’s binary classifier, is the confidence score vector predicted by the target classifier for the query data sample , 0 indicates that the query data sample is not a member of the target classifier’s training dataset, and 1 indicates that the query data sample is a member of the target classifier’s training dataset. For convenience, we call the attacker’s binary classifier attack classifier. We will discuss more details about how the attacker could train its attack classifier in Section 5. Note that, to consider strong attacks, we assume the attacker knows our defense mechanism, but the defender does not know the attack classifier since the attacker has many choices for the attack classifier.
3.3. Defender
The defender aims to defend against blackbox membership inference attacks. The defender could be the model provider itself or a trusted third party. For any query data sample from any user, the target classifier predicts its confidence score vector and the defender adds a noise vector to the confidence score vector before returning it to the user. Formally, we have:
where is the true confidence score vector predicted by the target classifier for a query data sample, is the noise vector added by the defender, and is the noisy confidence score vector that is returned to a user. Therefore, an attacker only has access to the noisy confidence score vectors. The defender aims to add noise to achieve the following two goals:

Goal I. The attacker’s attack classifier is inaccurate at inferring the members/nonmembers of the target classifier’s training dataset, i.e., protecting the privacy of the training dataset.

Goal II. The utility loss of the confidence score vector is bounded.
However, achieving these two goals faces several challenges which we discuss next.
Achieving Goal I: The first challenge to achieve Goal I is that the defender does not know the attacker’s attack classifier. To address the challenge, the defender itself trains a binary classifier to perform membership inference and adds noise vectors to the confidence score vectors such that its own classifier is inaccurate at inferring members/nonmembers. In particular, the defender’s classifier takes a confidence score vector as input and predicts member or nonmember for the corresponding data sample. We call the defender’s binary classifier defense classifier and denote its decision function as . Moreover, we consider the decision function represents the probability that the corresponding data sample, whose confidence score vector predicted by the target classifier is , is a member of the target classifier’s training dataset. In particular, we consider the defender trains a neural network classifier, whose output layer has one neuron with sigmoid activation function. For such classifier, the decision function’s output (i.e., the output of the neuron in the output layer) represents probability of being a member. Formally, we have:
The defense classifier predicts a data sample to be member of the target classifier’s training dataset if and only if .
To make the defense classifier inaccurate, one method is to add a noise vector to a true confidence score vector such that the defense classifier makes an incorrect prediction. Specifically, if the defense classifier predicts member (or nonmember) for the true confidence score vector, then the defender adds a noise vector such that the defense classifier predicts nonmember (or member) for the noisy confidence score vector. However, when an attacker knows the defense mechanism, the attacker can easily adapt its attack to achieve a high accuracy. In particular, the attacker predicts member (or nonmember) when its attack classifier predicts nonmember (or member) for a data sample. Another method is to add noise vectors such that the defense classifier always predicts member (or nonmember) for the noisy confidence score vectors. However, for some true confidence score vectors, such method may need noise that violates the utilityloss constraints of the confidence score vectors (we will discuss utilityloss constraints later in this section).
Randomized noise addition mechanism. Therefore, we consider the defender adopts a randomized noise addition mechanism denoted as . Specifically, given a true confidence score vector , the defender samples a noise vector from the space of possible noise vectors with a probability and adds it to the true confidence score vector. Since random noise is added to a true confidence score vector, the decision function outputs a random probability of being member. We consider the defender’s goal is to make the expectation of the probability of being member predicted by close to 0.5. In other words, the defender’s goal is to add random noise such that the defense classifier randomly guesses member or nonmember for a data sample on average. Formally, the defender aims to find a randomized noise addition mechanism such that is minimized.
Achieving Goal II: The key challenge to achieve Goal II is how to quantify the utility loss of the confidence score vector. To address the challenge, we introduce two utilityloss metrics.
Label loss. Our first metric concentrates on the query data sample’s label predicted by the target classifier. Recall that the label of a query data sample is predicted as the one that has the largest confidence score. If the true confidence score vector and the noisy confidence score vector predict the same label for a query data sample, then the label loss is 0 for the query data sample, otherwise the label loss is 1 for the query data sample. The overall label loss of a defense mechanism is the label loss averaged over all query data samples. In some critical applications such as finance and healthcare, even 1% of label loss may be intolerable. In this work, we aim to achieve 0 label loss, i.e., our noise does not change the predicted label for any query data sample. Formally, we aim to achieve , where and are the labels predicted based on the true and noisy confidence score vectors, respectively.
Confidence score distortion. The confidence score vector for a query data sample tells the user more information about the data sample’s label beyond the predicted label. Therefore, the added noise should not substantially distort the confidence score vector. First, the noisy confidence score vector should still be a probability distribution. Formally, we have for and . Second, the distance between the true confidence score vector and the noisy confidence score vector should be small. In particular, we consider the model provider specifies a confidence score distortion budget called , which indicates the upper bound of the expected confidence score distortion that the model provider can tolerate. Formally, we aim to achieve . While any distance metric can be used to measure the distortion, we consider norm of the noise vector as the distance metric, i.e., . We adopt norm of the noise vector because it is easy to interpret. Specifically, the norm of the noise vector is simply the sum of the absolute value of its entries.
Membership inference attack defense problem: After quantifying Goal I and Goal II, we can formally define our problem of defending against membership inference attacks.
Definition 3.1 (MembershipInferenceAttack Defense Problem).
Given the decision function of the defense classifier, a confidence score distortion budget , a true confidence score vector , the defender aims to find a randomized noise addition mechanism via solving the following optimization problem:
(1)  
(2)  subject to:  
(3)  
(4)  
(5) 
where the objective function of the optimization problem is to achieve Goal I and the constraints are to achieve Goal II. Specifically, the first constraint means that the added noise does not change the predicted label of the query data sample; the second constraint means that the confidence score distortion is bounded by the budget ; and the last two constraints mean that the noisy confidence score vector is still a probability distribution. Note that the last constraint is equivalent to since . Moreover, we adopt norm of the noise vector to measure the confidence score distortion, i.e., .
4. Our MemGuard
4.1. Overview
Finding the randomized noise addition mechanism is to solve the optimization problem in Equation 1. We consider two scenarios depending on whether is 0.5 or not.
Scenario I: In this scenario, . For such scenario, it is easy to solve the optimization problem in Equation 1. Specifically, the mechanism that adds the noise vector with probability 1 is the optimal randomized noise addition mechanism, with which the objective function has a value of 0.
Scenario II: In this scenario, is not 0.5. The major challenge to solve the optimization problem in this scenario is that the randomized noise addition mechanism is a probability distribution over the continuous noise space for a given true confidence score vector. The noise space consists of the noise vectors that satisfy the four constraints of the optimization problem. As a result, it is challenging to represent the probability distribution and solve the optimization problem. To address the challenge, we observe that the noise space can be divided into two groups depending on the output of the defense classifier’s decision function . Specifically, for noise vectors in one group, if we add any of them to the true confidence score vector, then the decision function outputs 0.5 as the probability of being member. For noise vectors in the other group, if we add any of them to the true confidence score vector, then the decision function outputs a probability of being member that is not 0.5.
Based on this observation, we propose a twophase framework to approximately solve the optimization problem. Specifically, in Phase I, for each noise group, we find the noise vector with minimum confidence score distortion (i.e., is minimized) as a representative noise vector for the noise group. We select the noise vector with minimum confidence score distortion in order to minimize the confidence score distortion. Since , the selected representative noise vector for the second noise group is . We denote by the selected representative noise vector for the first noise group. In Phase II, we assume the randomized noise addition mechanism is a probability distribution over the two representative noise vectors instead of the overall noise space. Specifically, the defender adds the representative noise vector to the true confidence score vector with a certain probability and does not add any noise with the remaining probability.
Next, we introduce our Phase I and Phase II.
4.2. Phase I: Finding
Finding as solving an optimization problem: Our goal essentially is to find a noise vector such that 1) the utility loss of the confidence score vector is minimized and 2) the decision function outputs 0.5 as the probability of being member when taking the noisy confidence score vector as an input. Formally, we find such noise vector via solving the following optimization problem:
(6)  
(7)  subject to:  
(8)  
(9)  
(10) 
where is the true confidence score vector, the objective function means that the confidence score distortion is minimized, the first constraint means that the noise does not change the predicted label of the query data sample, the second constraint means that the defense classifier’s decision function outputs 0.5 (i.e., the defense classifier’s prediction is random guessing), and the last two constraints mean that the noisy confidence score vector is still a probability distribution.
Solving the optimization problem in Equation 6 can be viewed as finding an adversarial example to evade the defense classifier. In particular, is a normal example and is an adversarial example. The adversarial machine learning community has developed many algorithms (e.g., (Carlini and Wagner, 2017; Papernot et al., 2016a; Goodfellow et al., 2015; Madry et al., 2018; Kurakin et al., 2016; MoosaviDezfooli et al., 2016; Tramèr et al., 2017; MoosaviDezfooli et al., 2017)) to find adversarial examples. However, these algorithms are insufficient to our problem because they did not consider the unique challenges of privacy protection. In particular, they did not consider the utilityloss constraints, i.e., the constraints in Equation 7, Equation 9, and Equation 10.
One naive method (we call it Random) to address the challenges is to generate a random noise vector that satisfies the utilityloss constraints. In particular, we can generate a random vector whose entries are nonnegative and sum to 1. For instance, we first sample a number from the interval [0,1] uniformly at random as the first entry. Then, we sample a number from the interval [0, 1] uniformly at random as the second entry. We repeat this process until the last entry is 1 minus the sum of the previous entries. Then, we exchange the largest entry of to the position to satisfy the constraint 7. Finally, we treat as the noise vector, which is a solution to the optimization problem in Equation 6. However, as we will show in experiments, this Random method achieves suboptimal privacyutility tradeoffs because the noise vector is not optimized and it is challenging to satisfy the constraint Equation 9. We propose to solve the optimization problem via change of variables and adding the constraints to the objective function.
Eliminating the constraints on probability distribution via change of variables: Since we consider the target classifier to be a neural network, whose output layer is a softmax layer, the true confidence score vector is a softmax function of some vector . The vector is the output of the neurons in the secondtolast layer of the neural network and is often called logits of the neural network. Formally, we have:
(11) 
Moreover, we model the noisy confidence score vector as follows:
(12) 
where is a new vector variable. For any value of , the noisy confidence score vector is a probability distribution, i.e., the constraints in Equation 9 and Equation 10 are satisfied. Therefore, in the optimization problem in Equation 6, we change the true confidence score vector as and change the variable as . Then, we obtain the following optimization problem:
(13)  
(14)  subject to:  
(15) 
After solving in the above optimization problem, we can obtain the noise vector as follows:
(16) 
The optimization problem without the constraints on probability distribution is still challenging to solve because the remaining two constraints are highly nonlinear. To address the challenge, we turn the constraints into the objective function.
Turning the constraint in Equation 15 into the objective function: We consider the defender’s binary defense classifier is a neural network whose output layer has a single neuron with sigmoid activation function. Therefore, we have:
(17) 
where is the output of the neuron in the secondtolast layer of the defense classifier when the defense classifier takes the noisy confidence score vector as an input. In other words, is the logit of the defense classifier. implies . Therefore, we transform the constraint in Equation 15 to the following loss function:
(18) 
where is small when is close to 0.
Turning the constraint in Equation 14 into the objective function: We denote by the predicted label for the query data sample, i.e., . The constraint in Equation 14 means that is the largest entry in the vector . Therefore, we enforce the inequality constraint . Moreover, we further transform the inequality constraint to the following loss function:
(19) 
where the function ReLU is defined as ReLU=. The loss function is 0 if the inequality holds.
Unconstrained optimization problem: After transforming the constraints into the objective function, we have the following unconstrained optimization problem:
(20) 
where , while and balance between the three terms.
Solving the unconstrained optimization problem: We design an algorithm based on gradient descent to solve the unconstrained optimization problem. Algorithm 1 shows our algorithm. Since we aim to find a noise vector that has a small confidence score distortion, we iteratively search a large . For each given , we use gradient descent to find that satisfies the constraints in Equation 14 and Equation 15. The process of searching stops when we cannot find a vector that satisfies the two constraints. Specifically, given , , and a learning rate , we iteratively update the vector variable (i.e., the inner while loop in Algorithm 1). Since we transform the constraints in Equation 14 and Equation 15 into the objective function, there is no guarantee that they are satisfied during the iterative gradient descent process. Therefore, in each iteration of gradient descent, we check whether the two constraints are satisfied (i.e., Line 8 in Algorithm 1). Specifically, we continue the gradient descent process when the predicted label changes or the sign of the logit does not change. In other words, we stop the gradient descent process when both constraints are satisfied. We use to approximate the constraint in Equation 15. In particular, the constraint in Equation 15 is equivalent to . Once we find a vector such that and have different signs (e.g., and ), just crosses 0 and should be close to 0 since we use a small learning rate. Note that we could also iteratively search , but it is computationally inefficient to search both and .
4.3. Phase II
After Phase I, we have two representative noise vectors. One is and the other is . In Phase II, we assume the randomized noise addition mechanism is a probability distribution over the two representative noise vectors instead of the entire noise space. Specifically, we assume that the defender picks the representative noise vectors and with probabilities and , respectively; and the defender adds the picked representative noise vector to the true confidence score vector. With such simplification, we can simplify the optimization problem in Equation 1 to the following optimization problem:
(21)  
(22)  subject to: 
where the constraint means that the expected confidence score distortion is bounded by the budget. Note that we omit the other three constraints in Equation 2, Equation 4, and Equation 5. This is because both of our representative noise vectors already satisfy those constraints. Moreover, we can derive an analytical solution to the simplified optimization problem. The analytical solution is as follows:
(23) 
Onetime randomness: If the defender randomly samples one of the two representative noise vectors every time for the same query data sample, then an attacker could infer the true confidence score vector via querying the same data sample multiple times. We consider the attacker knows our defense mechanism including the confidence score distortion metric , the budget , and that the noise vector is sampled from two representative noise vectors, one of which is .
Suppose the attacker queries the same data sample times from the target classifier. The attacker receives a confidence score vector for times and a confidence score vector for times. One confidence score vector is and the other is the true confidence score vector . Since the attacker receives two different confidence score vectors, the attacker knows . Moreover, given the two confidence score vectors, the attacker can compute according to Equation 23 since the distance does not depend on the ordering of and , i.e., . The attacker can also estimate the probabilities that the defender returns the confidence score vectors and as and , respectively. If is closer to , then the attacker predicts that is the true confidence score vector, otherwise the attacker predicts to be the true confidence score vector.
To address this challenge, we propose to use onetime randomness when the defender samples the representative noise, with which the defender always returns the same confidence score vector for the same query data sample. Specifically, for a query data sample, the defender quantizes each dimension of the query data sample and computes the hash value of the quantized data sample. Then, the defender generates a random number in the range via a pseudo random number generator with the hash value as the seed. If , the defender adds the representative noise vector to the true confidence score vector, otherwise the defender does not add noise. The random number is the same for the same query data sample, so the defender always returns the same confidence score vector for the same query data sample. We compute the hash value of the quantized query data sample as the seed such that the attacker cannot just slightly modify the query data sample to generate a different . The attacker can compute the random number as we assume the attacker knows the defense mechanism including the hash function and pseudo random number generator. However, the attacker does not know any more because the defender always returns the same confidence score vector for the same query data sample. Therefore, the attacker does not know whether the returned confidence score vector is the true one or not.
5. Evaluation
5.1. Experimental Setup
5.1.1. Datasets
We use three datasets that represent different application scenarios.
Location: This dataset was preprocessed from the Foursquare dataset^{1}^{1}1https://sites.google.com/site/yangdingqi/home/foursquaredataset and we obtained it from (Shokri et al., 2017). The dataset has 5,010 data samples with 446 binary features, each of which represents whether a user visited a particular region or location type. The data samples are grouped into clusters. This dataset represents a 30class classification problem, where each cluster is a class.
Texas100: This dataset is based on the Discharge Data public use files published by the Texas Department of State Health Services.^{2}^{2}2https://www.dshs.texas.gov/THCIC/Hospitals/Download.shtm We obtained the preprocessed dataset from (Shokri et al., 2017). The dataset has data samples with binary features. These features represent the external causes of injury (e.g., suicide, drug misuse), the diagnosis, the procedures the patient underwent, and some generic information (e.g., gender, age, and race). Similar to (Shokri et al., 2017), we focus on the most frequent procedures and the classification task is to predict a procedure for a patient using the patient’s data. This dataset represents a 100class classification problem.
CHMNIST: This dataset is used for classification of different tissue types on histology tile from patients with colorectal cancer. The dataset contains images from tissues. The classification task is to predict tissue for an image, i.e., the dataset is a 8class classification problem. The size of each image is . We obtained a preprocessed version from Kaggle. ^{3}^{3}3https://www.kaggle.com/kmader/colorectalhistologymnist.
Dataset splits: For each dataset, we will train a target classifier, an attack classifier, and a defense classifier. Therefore, we split each dataset into multiple folds. Specifically, for the Location (or CHMNIST) dataset, we randomly sample 4 disjoint sets, each of which includes 1,000 data samples. We denote them as , , , and , respectively. For the Texas100 dataset, we also randomly sample such 4 disjoint sets, but each set includes 10,000 data samples as the Texas100 dataset is around one order of magnitude larger. Roughly speaking, for each dataset, we use , , and to learn the target classifier, the attack classifier, and the defense classifier, respectively; and we use to evaluate the accuracy of the attack classifier. We will describe more details on how the sets are used when we use them.
Layer Type  Layer Parameters 

Input  
Convolution  , strides=, padding=same 
Activation  ReLU 
Convolution  , strides= 
Activation  ReLU 
Pooling  MaxPooling 
Convolution  , strides=, padding=same 
Activation  ReLU 
Convolution  , strides= 
Activation  ReLU 
Pooling  MaxPooling 
Flatten  
Fully Connected  512 
Fully Connected  8 
Activation  softmax 
Output 
5.1.2. Target Classifiers
For the Location and Texas100 datasets, we use a fullyconnected neural network with hidden layers as the target classifier. The number of neurons for the four layers are 1024, 512, 256, and 128, respectively. We use the popular activation function ReLU for the neurons in the hidden layers. The activation function in the output layer is softmax. We adopt the crossentropy loss function and use Stochastic Gradient Descent (SGD) to learn the model parameters. We train epochs with a learning rate , and we decay the learning rate by in the th epoch for better convergence. For the CHMNIST dataset, the neural network architecture of the target classifier is shown in Table 2. Similarly, we also adopt the crossentropy loss function and use SGD to learn the model parameters. We train epochs with a learning rate and decay the learning rate by in the th epoch. For each dataset, we use to train the target classifier. Table 3 shows the training and testing accuracies of the target classifiers on the three datasets, where the testing accuracy is calculated by using the target classifier to make predictions for the data samples that are not in .
5.1.3. Membership Inference Attacks
In a membership inference attack, an attacker trains an attack classifier, which predicts member or nonmember for a query data sample. The effectiveness of an attack is measured by the inference accuracy of the attack classifier, where the inference accuracy is the fraction of data samples in that the attack classifier can correctly predict as member or nonmember. In particular, data samples in are members of the target classifier’s training dataset, while data samples in are nonmembers. We call the dataset evaluation dataset. We consider two categories of stateoftheart blackbox membership inference attacks, i.e., nonadaptive attacks and adaptive attacks. In nonadaptive attacks, the attacker does not adapt its attack classifier based on our defense, while the attacker adapts its attack classifier based on our defense in adaptive attacks.
Nonadaptive attacks: We consider the random guessing attack and stateoftheart attacks as follows.
Random guessing (RG) attack. For any query data sample, this attack predicts it to be a member of the target classifier’s training dataset with probability 0.5. The inference accuracy of the RG attack is 0.5.
Location  Texas100  CHMNIST  

Training Accuracy  100.0%  99.98%  99.0% 
Testing Accuracy  60.32%  51.59%  72.0% 
Neural Network (NN) attack (Shokri et al., 2017; Salem et al., 2019b). This attack assumes that the attacker knows the distribution of the target classifier’s training dataset and the architecture of the target classifier. We further split the dataset into two halves denoted as and , respectively. The attacker uses to train a shadow classifier that has the same neural network architecture as the target classifier. After training the shadow classifier, the attacker calculates the confidence score vectors for the data samples in and , which are members and nonmembers of the shadow classifier. Then, the attacker ranks each confidence score vector and treats the ranked confidence score vectors of members and nonmembers as a “training dataset” to train an attack classifier. The attack classifier takes a data sample’s ranked confidence score vector as an input and predicts member or nonmember. For all three datasets, we consider the attack classifier is a fullyconnected neural network with three hidden layers, which have 512, 256, and 128 neurons, respectively. The output layer just has one neuron. The neurons in the hidden layers use the ReLU activation function, while the neuron in the output layer uses the sigmoid activation function. The attack classifier predicts member if and only if the neuron in the output layer outputs a value that is larger than 0.5. We train the attack classifier for 400 epochs with a learning rate 0.01 using SGD and decay the learning rate by 0.1 at the 300th epoch.
Random Forest (RF) attack. This attack is the same as the NN attack except that RF attack uses random forest as the attack classifier, while NN uses a neural network as the attack classifier. We use scikitlearn with the default setting to learn random forest classifiers. We consider this RF attack to demonstrate that our defense mechanism is still effective even if the attack classifier and the defense classifier (a neural network) use different types of algorithms, i.e., the noise vector that evades the defense classifier can also evade the attack classifier even if the two classifiers use different types of algorithms.
NSH attack (Nasr et al., 2018). Nasr, Shokri, and Houmansadr (Nasr et al., 2018) proposed this attack, which we abbreviate as NSH. This attack uses multiple neural networks. One network operates on the confidence score vector. Another one operates on the label which is one hot encoded. Both networks are fullyconnected and have the same number of input dimension, i.e., the number of classes of the target classifier. Specifically, NSH assumes the attacker knows some members and nonmembers of the target classifier’s training dataset. In our experiments, we assume the attacker knows of data samples in (i.e., members) and 30% of data samples in (i.e., nonmembers). The attacker uses these data samples to train the attack classifier. We adopt the neural network architecture in (Nasr et al., 2018) as the attack classifier. The remaining 70% of data samples in and are used to calculate the inference accuracy of the attack classifier. We train the attack classifier for epochs with an initial learning rate and decay the learning rate by after epochs.
Adaptive attacks: We consider two attacks that are customized to our defense.
Adversarial training (NNAT). One adaptive attack is to train the attack classifier via adversarial training, which was considered to be the most empirically robust method against adversarial examples so far (Athalye et al., 2018). We adapt the NN attack using adversarial training and denote the adapted attack as NNAT. Specifically, for each data sample in and , the attacker calculates its confidence score vector using the shadow classifier. Then, the attacker uses the Phase I of our defense to find the representative noise vector and adds it to the confidence score vector to obtain a noisy confidence score vector. Finally, the attacker trains the attack classifier via treating the true confidence score vectors and their corresponding noisy versions of data samples in and as a training dataset.
Rounding (NNR). Since our defense adds carefully crafted small noise to the confidence score vector, an adaptive attack is to round each confidence score before using the attack classifier to predict member/nonmember. Specifically, we consider the attacker rounds each confidence score to be one decimal and uses the NN attack. Note that rounding is also applied when training the NN attack classifier. We denote this attack NNR.
Table 4 shows the inference accuracies of different attacks when our defense is not used. All attacks except RG have inference accuracies that are larger or substantially larger than 0.5.
Location  Texas100  CHMNIST  

RG  50.0%  50.0%  50.0% 
NN  73.0%  68.9%  62.9% 
RF  73.7%  67.3%  58.7% 
NSH  81.1%  74.0%  58.4% 
NNAT  64.6%  68.3%  63.3% 
NNR  72.9%  69.2%  63.0% 
5.1.4. Defense Setting
In our defense, we need to specify a defense classifier and the parameters in Algorithm 1.
Defense classifier: The defender itself trains a classifier to perform membership inference. We consider the defense classifier is a neural network. However, since the defender does not know the attacker’s attack classifier, we assume the defense classifier and the attack classifier use different neural network architectures. Specifically, we consider three different defense classifiers in order to study the impact of defense classifier on MemGuard. The three defense classifiers are fullyconnected neural networks with 2, 3, and 4 hidden layers, respectively. The hidden layers of the three defense classifiers have (256, 128), (256, 128, 64), and (512, 256, 128, 64) neurons, respectively. The output layer has just one neuron. The activation function for the neurons in the hidden layers is , while the neuron in the output layer uses the sigmoid activation function. Unless otherwise mentioned, we use the defense classifier with 3 hidden layers. The defender calculates the confidence score vector for each data sample in and using the target classifier. The confidence score vectors for data samples in and have labels “member” and “nonmember”, respectively. The defender treats these confidence score vectors as a training dataset to learn a defense classifier, which takes a confidence score vector as an input and predicts member or nonmember. We train a defense classifier for epochs with a learning rate . We note that we can also synthesize data samples based on as nonmembers (Appendix A shows details).
Parameter setting: We set and in Algorithm 1. We found that once is larger than some threshold, MemGuard’s effectiveness does not change. Since we aim to find representative noise vector that does not change the predicted label, we assign a relatively large value to , which means that the objective function has a large value if the predicted label changes (i.e., the loss function is nonzero). In particular, we set . Our Algorithm 1 searches for a large and we set the initial value of to be 0.1. We also compare searching with searching .
5.2. Experimental Results
MemGuard is effective: Figure 1 shows the inference accuracies of different attacks as the confidence score distortion budget increases on the three datasets. Since we adopt the expected norm of the noise vector to measure the confidence score distortion, the confidence score distortion is in the range [0, 2]. Note that our defense is guaranteed to achieve 0 label loss as our Algorithm 1 guarantees that the predicted label does not change when searching for the representative noise vector. We observe that our MemGuard can effectively defend against membership inference attacks, i.e., the inference accuracies of all the evaluated attacks decrease as our defense is allowed to add larger noise to the confidence score vectors. For instance, on Location, when our defense is allowed to add noise whose expected norm is around 0.8, our defense can reduce all the evaluated attacks to the random guessing (RG) attack; on CHMNIST, our defense can reduce the NSH attack (or the remaining attacks) to random guessing when allowed to add noise whose expected norm is around 0.3 (or 0.7).
Indistinguishability between the confidence score vectors of members and nonmembers: We follow previous work (Nasr et al., 2018) to study the distribution of confidence score vectors of members vs. nonmembers of the target classifier. Specifically, given a confidence score vector , we compute its normalized entropy as follows:
(24) 
where is the number of classes in the target classifier. Figure 2 shows the distributions of the normalized entropy of the confidence score vectors for members (i.e., data samples in ) and nonmembers (i.e., data samples in ) of the target classifier, where we set the confidence score distortion budget to be 1 when our defense is used. The gap between the two curves in a graph corresponds to the information leakage of the target classifier’s training dataset. Our defense substantially reduces such gaps. Specifically, the maximum gap between the two curves (without defense vs. with defense) is (0.27 vs. 0.11), (0.41 vs. 0.05), and (0.30 vs. 0.06) on the Location, Texas100, and CHMNIST datasets, respectively. Moreover, the average gap between the two curves (without defense vs. with defense) is (0.062 vs. 0.011), (0.041 vs. 0.005), and (0.030 vs. 0.006) on the three datasets, respectively.
Searching vs. searching : Figure (a)a shows the inference accuracy of the NN attack as the confidence score distortion budget increases when fixing to different values and searching . Figure (b)b shows the results when fixing and searching . We observe that MemGuard is insensitive to the setting of when searching . Specifically, MemGuard has almost the same effectiveness when fixing to different values, i.e., the different curves overlap in Figure (a)a. This is because when our Phase I stops searching the noise vector, the predicted label is preserved, which means that the loss function is 0. However, MemGuard is sensitive to the setting of when searching . Specifically, when fixing to be 0.1, searching achieves the same effectiveness as searching . However, when fixing to be 1.0, searching is less effective. Therefore, we decided to search while fixing .
Impact of defense classifiers: Figure 4 shows the inference accuracy of the NN attack as the confidence score distortion budget increases on the Location dataset when using different defense classifiers. We observe that MemGuard has similar effectiveness for different defense classifiers, which means that our carefully crafted noise vectors can transfer between classifiers.
MemGuard outperforms existing defenses: We compare with stateoftheart defenses including Regularizer (Shokri et al., 2017), MinMax Game (Nasr et al., 2018), Dropout (Salem et al., 2019b), Model Stacking (Salem et al., 2019b), and DPSGD (Abadi et al., 2016). Each compared defense (except Model Stacking) has a hyperparameter to control the privacyutility tradeoff. For instance, the hyperparameter that balances between the loss function and the regularizer in Regularizer, the hyperparameter that balances between the loss function and the adversarial regularizer in MinMax Game, the dropout rate in Dropout, the privacy budget in DPSGD, and in MemGuard. We also compare with MemGuardRandom in which we use the Random method (refer to Section 4.2) to generate the noise vector in Phase I.
Before deploying any defense, we use the undefended target classifier to compute the confidence score vector for each data sample in the evaluation dataset . For each defense and a given hyperparameter, we apply the defense to the target classifier and use the defended target classifier to compute the confidence score vector for each data sample in . Then, we compute the confidence score distortion for each data sample and obtain the average confidence score distortion on the evaluation dataset . Moreover, we compute the inference accuracy of the attack classifier (we consider NN in these experiments) on the evaluation dataset after the defense is used. Therefore, for each defense and a given hyperparameter, we can obtain a pair (inference accuracy, average confidence score distortion). Via exploring different hyperparameters, we can obtain a set of such pairs for each defense. Then, we plot these pairs on a graph, which is shown in Figure 5.
Specifically, we tried the hyperparameter of Regularizer in the range with a step size , , and for Location, Texas100, and CH_MNIST datasets, respectively. We tried the hyperparameter of MinMax Game in the range with a step size . We tried the dropout rate of Dropout in the range with a step size . We use a publicly available implementation^{4}^{4}4https://github.com/tensorflow/privacy of DPSGD. We tried the parameter that controls the privacy budget in the range with a step size . We tried as the in MemGuard and MemGuardRandom.
Location  Texas100  CHMNIST  

Inference Acc.  50.0%  50.8%  50.0% 
Average Distortion  1.63  1.28  0.81 
Label Loss  56.3%  37.9%  18.3% 
Our results show that MemGuard achieves the best privacyutility tradeoff. In particular, given the same average confidence score distortion, MemGuard achieves the smallest inference accuracy. According to the authors of Model Stacking, it does not have a hyperparameter to easily control the privacyutility tradeoff. Therefore, we just obtain one pair of (inference accuracy, average confidence score distortion) and Table 5 shows the results. Model Stacking reduces the inference accuracy to be close to 0.5, but the utility loss is intolerable.
Similarly, we can obtain a set of pairs (inference accuracy, label loss) for the compared defenses and Figure 6 shows inference accuracy vs. label loss on the three datasets. Label loss is the fraction of data samples in the evaluation dataset whose predicted labels are changed by a defense. MemGuardRandom and MemGuard achieve 0 label loss. However, other defenses incur large label losses in order to substantially reduce the attacker’s inference accuracy.
6. Discussion and Limitations
On one hand, machine learning can be used by attackers to perform automated inference attacks. On the other hand, machine learning has various vulnerabilities, e.g., adversarial examples (Carlini and Wagner, 2017; Papernot et al., 2016a, 2017; Szegedy et al., 2013; Papernot et al., 2018, 2016c; Kurakin et al., 2016; Goodfellow et al., 2015). Therefore, attackers who rely on machine learning also share its vulnerabilities and we can exploit such vulnerabilities to defend against them. For instance, we can leverage adversarial examples to mislead attackers who use machine learning classifiers to perform automated inference attacks (Jia and Gong, 2019). One key challenge in this research direction is how to extend existing adversarial example methods to address the unique challenges of privacy protection. For instance, how to achieve formal utilityloss guarantees.
In this work, we focus on membership inference attacks under the blackbox setting, in which an attacker uses a binary classifier to predict a data sample to be a member or nonmember of a target classifier’s training dataset. In particular, the attacker’s classifier takes a data sample’s confidence score vector predicted by the target classifier as an input and predicts member or nonmember. Our defense adds carefully crafted noise to a confidence score vector to turn it into an adversarial example, such that the attacker’s classifier is likely to predict member or nonmember incorrectly. To address the challenges of achieving formal utilityloss guarantees, e.g., 0 label loss and bounded confidence score distortion, we design new methods to find adversarial examples.
Other than membership inference attacks, many other attacks rely on machine learning classifiers, e.g., attribute inference attacks (Chaabane et al., 2012; Gong and Liu, 2016; Jia et al., 2017), website fingerprinting attacks (Cai et al., 2012; Juarez et al., 2014; Wang et al., 2014; Panchenko et al., 2011; Herrmann et al., 2009), sidechannel attacks (Zhang et al., 2012), location attacks (Backes et al., 2017; Oya et al., 2017; Pyrgelis et al., 2018; Zhang et al., 2018), and author identification attacks (Narayanan et al., 2012; Caliskan et al., 2018). For instance, online social network users are vulnerable to attribute inference attacks, in which an attacker leverages a machine learning classifier to infer users’ private attributes (e.g., gender, political view, and sexual orientation) using their public data (e.g., page likes) on social networks. The Facebook data privacy scandal in 2018^{5}^{5}5https://bit.ly/2IDchsx is a notable example of attribute inference attack. In particular, Cambridge Analytica leveraged a machine learning classifier to automatically infer a large amount of Facebook users’ various private attributes using their public page likes. Jia and Gong proposed AttriGuard (Jia and Gong, 2018), which leverages adversarial examples to defend against attribute inference attacks. In particular, AttriGuard extends an existing adversarial example method to incorporate the unique challenges of privacy protection. The key difference between MemGuard and AttriGuard is that finding adversarial examples for confidence score vectors is subject to unique constraints, e.g., an adversarial confidence score vector should still be a probability distribution and the predicted label should not change. Such unique constraints require substantially different methods to find adversarial confidence score vectors. Other studies have leveraged adversarial examples to defend against traffic analysis (Zhang et al., 2019) and author identification (Quiring et al., 2019; Meng et al., 2018). However, these studies did not consider formal utilityloss guarantees.
We believe it is valuable future work to extend MemGuard to defend against other machine learning based inference attacks such as website fingerprinting attacks, sidechannel attacks, and membership inference attacks in the whitebox setting. Again, a key challenge is how to achieve formal utilityloss guarantees with respect to certain reasonable utilityloss metrics.
Our MemGuard has a parameter , which controls a tradeoff between membership privacy and confidence score vector distortion. The setting of may be datasetdependent. One way to set is to leverage an inference accuracy vs. curve as shown in Figure 1. Specifically, given a dataset, we draw the inference accuracy vs. curves for various attack classifiers. Suppose we desire the inference accuracy to be less than a threshold. Then, we select the smallest such that the inference accuracies of all the evaluated attack classifiers are no larger than the threshold.
7. Conclusion and Future Work
In this work, we propose MemGuard to defend against blackbox membership inference attacks. MemGuard is the first defense that has formal utilityloss guarantees on the confidence score vectors predicted by the target classifier. MemGuard works in two phases. In Phase I, MemGuard leverages a new algorithm to find a carefully crafted noise vector to turn a confidence score vector into an adversarial example. The new algorithm considers the unique utilityloss constraints on the noise vector. In Phase II, MemGuard adds the noise vector to the confidence score vector with a certain probability, for which we derive an analytical solution. Our empirical evaluation results show that MemGuard can effectively defend against blackbox membership inference attacks and outperforms existing defenses. An interesting future work is to extend MemGuard to defend against other types of machine learning based inference attacks such as whitebox membership inference attacks, website fingerprinting attacks, and sidechannel attacks.
Acknowledgements.
We thank the anonymous reviewers for insightful reviews. We would like to thank Hao Chen (University of California, Davis) for discussions. This work was partially supported by NSF grant No. 1937786.References
 Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 308–318. Cited by: §1, §1, §2.1, §5.2.
 Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers. CoRR abs/1306.4447. Cited by: §2.1.
 Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Proceedings of the 2018 International Conference on Machine Learning (ICML), pp. 274–283. Cited by: §2.2, §5.1.3.
 Membership Privacy in MicroRNAbased Studies. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 319–330. Cited by: §2.1.
 walk2friends: Inferring Social Links from Mobility Profiles. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1943–1957. Cited by: §6.
 Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In Proceedings of the 2014 Annual Symposium on Foundations of Computer Science (FOCS), pp. 464–473. Cited by: §1, §2.1.
 Touching from a Distance: Website Fingerprinting Attacks and Defenses. In Proceedings of the 2012 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 605–616. Cited by: §6.
 When Coding Style Survives Compilation: Deanonymizing Programmers from Executable Binaries. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS), Cited by: §6.
 Mitigating Evasion Attacks to Deep Neural Networks via Regionbased Classification. In Proceedings of the 2017 Annual Computer Security Applications Conference (ACSAC), pp. 278–287. Cited by: §2.2.
 Towards Evaluating the Robustness of Neural Networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P), pp. 39–57. Cited by: §1, §2.2, §4.2, §6.
 You Are What You Like! Information Leakage Through Users’ Interests. In Proceedings of the 2012 Network and Distributed System Security Symposium (NDSS), Cited by: §6.
 Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research. Cited by: §1, §2.1.
 Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the 2006 Theory of Cryptography Conference (TCC), pp. 265–284. Cited by: §1, §2.1.
 Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. In Proceedings of the 2015 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1322–1333. Cited by: §2.1.
 Privacy in Pharmacogenetics: An EndtoEnd Case Study of Personalized Warfarin Dosing. In Proceedings of the 2014 USENIX Security Symposium (USENIX Security), pp. 17–32. Cited by: §2.1.
 Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 619–633. Cited by: §2.1.
 You are Who You Know and How You Behave: Attribute Inference Attacks via Users’ Social Friends and Behaviors. In Proceedings of the 2016 USENIX Security Symposium (USENIX Security), pp. 979–995. Cited by: §6.
 Generative Adversarial Nets. In Proceedings of the 2014 Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §2.1.
 Explaining and Harnessing Adversarial Examples. In Proceedings of the 2015 International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §2.2, §4.2, §6.
 MBeacon: PrivacyPreserving Beacons for DNA Methylation Data. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS), Cited by: §2.1.
 LOGAN: Evaluating Privacy Leakage of Generative Models Using Generative Adversarial Networks. Symposium on Privacy Enhancing Technologies Symposium. Cited by: §2.1.
 Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial NaiveBayes Classifier. In Proceedings of the 2009 ACM Cloud Computing Security Workshop (CCSW), pp. 31–41. Cited by: §6.
 Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using HighDensity SNP Genotyping Microarrays. PLOS Genetics. Cited by: §2.1.
 Towards Practical Differentially Private Convex Optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (S&P), Cited by: §1, §2.1.
 Evaluating Differentially Private Machine Learning in Practice. In Proceedings of the 2014 USENIX Security Symposium (USENIX Security), pp. 1895–1912. Cited by: §1.
 AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning. In Proceedings of the 2018 USENIX Security Symposium (USENIX Security), Cited by: §6.
 Defending against Machine Learning based Inference Attacks via Adversarial Examples: Opportunities and Challenges. CoRR abs/1909.08526. Cited by: §6.
 AttriInfer: Inferring User Attributes in Online Social Networks Using Markov Random Fields. In Proceedings of the 2017 International Conference on World Wide Web (WWW), pp. 1561–1569. Cited by: §6.
 A Critical Evaluation of Website Fingerprinting Attacks. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 263–274. Cited by: §6.
 Private Convex Optimization for Empirical Risk Minimization with Applications to Highdimensional Regression. In Proceedings of the 2012 Annual Conference on Learning Theory (COLT), pp. 1–25. Cited by: §1, §2.1.
 Adversarial Examples in the Physical World. CoRR abs/1607.02533. Cited by: §1, §2.2, §4.2, §6.
 Delving into Transferable Adversarial Examples and Blackbox Attacks. CoRR abs/1611.02770. Cited by: §1.
 Towards Measuring Membership Privacy. CoRR abs/1712.09136. Cited by: §2.1.
 Understanding Membership Inferences on WellGeneralized Learning Models. CoRR abs/1802.04889. Cited by: §2.1, §3.2.
 Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §2.2, §4.2.
 Exploiting Unintended Feature Leakage in Collaborative Learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (S&P), Cited by: §2.1, §2.1.
 MagNet: A TwoPronged Defense against Adversarial Examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 135–147. Cited by: §2.2.
 Adversarial Binaries for Authorship Identification. CoRR abs/1809.08316. Cited by: §6.
 Universal Adversarial Perturbations. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1765–1773. Cited by: §1, §2.2, §4.2.
 Deepfool: A Simple and Accurate Method to Fool Deep Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582. Cited by: §1, §2.2, §4.2.
 On the Feasibility of InternetScale Author Identification. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (S&P), pp. 300–314. Cited by: §6.
 Machine Learning with Membership Privacy using Adversarial Regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §1, §1, §2.1, §2.1, §3.2, §5.1.3, §5.2, §5.2.
 Comprehensive Privacy Analysis of Deep Learning: Passive and Active Whitebox Inference Attacks against Centralized and Federated Learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (S&P), Cited by: §1, §1, §2.1.
 Towards ReverseEngineering BlackBox Neural Networks. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), Cited by: §2.1.
 Back to the Drawing Board: Revisiting the Design of Optimal Location Privacypreserving Mechanisms. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1943–1957. Cited by: §6.
 Website Fingerprinting in Onion Routing Based Anonymization Networks. In Proceedings of the 2011 Workshop on Privacy in the Electronic Society (WPES), pp. 103–114. Cited by: §6.
 Practical BlackBox Attacks Against Machine Learning. In Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security (ASIACCS), pp. 506–519. Cited by: §1, §6.
 The Limitations of Deep Learning in Adversarial Settings. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (Euro S&P), pp. 372–387. Cited by: §1, §2.2, §4.2, §6.
 Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (S&P), pp. 582–597. Cited by: §2.2.
 Transferability in Machine Learning: from Phenomena to BlackBox Attacks using Adversarial Samples. CoRR abs/1605.07277. Cited by: §1, §6.
 SoK: Towards the Science of Security and Privacy in Machine Learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (Euro S&P), Cited by: §1, §6.
 Knock Knock, Who’s There? Membership Inference on Aggregate Location Data. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS), Cited by: §2.1, §6.
 Under the Hood of Membership Inference Attacks on Aggregate Location TimeSeries. CoRR abs/1902.07456. Cited by: §2.1.
 Misleading Authorship Attribution of Source Code using Adversarial Learning. In Proceedings of the 2019 USENIX Security Symposium (USENIX Security), pp. 479–496. Cited by: §6.
 UpdatesLeak: Data Set Inference and Reconstruction Attacks in Online Learning. CoRR abs/1904.01067. Cited by: §2.1.
 MLLeaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS), Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §2.1, §3.2, §5.1.3, §5.2.
 PrivacyPreserving Deep Learning. In Proceedings of the 2015 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1310–1321. Cited by: §2.1.
 Membership Inference Attacks Against Machine Learning Models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P), pp. 3–18. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §3.2, §5.1.1, §5.1.1, §5.1.3, §5.2.
 Privacy Risks of Securing Machine Learning Models against Adversarial Examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §1.
 Stochastic Gradient Descent with Differentially Private Updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 245–248. Cited by: §1, §2.1.
 Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research. Cited by: §2.1.
 Intriguing Properties of Neural Networks. CoRR abs/1312.6199. Cited by: §1, §6.
 Ensemble Adversarial Training: Attacks and Defenses. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §2.2, §4.2.
 Stealing Machine Learning Models via Prediction APIs. In Proceedings of the 2016 USENIX Security Symposium (USENIX Security), pp. 601–618. Cited by: §2.1.
 Stealing Hyperparameters in Machine Learning. In Proceedings of the 2018 IEEE Symposium on Security and Privacy (S&P), Cited by: §2.1.
 Differentially Private Empirical Risk Minimization Revisited: Faster and More General. In Proceedings of the 2017 Annual Conference on Neural Information Processing Systems (NIPS), pp. 2722–2731. Cited by: §1, §2.1.
 Effective Attacks and Provable Defenses for Website Fingerprinting. In Proceedings of the 2014 USENIX Security Symposium (USENIX Security), pp. 143–157. Cited by: §6.
 Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS), Cited by: §2.2.
 Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In Proceedings of the 2018 IEEE Computer Security Foundations Symposium (CSF), Cited by: §2.1.
 Differentially Private Model Publishing for Deep Learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (S&P), Cited by: §1, §2.1.
 Statistical Privacy for Streaming Traffic. In Proceedings of the 2019 Network and Distributed System Security Symposium (NDSS), Cited by: §6.
 Tagvisor: A Privacy Advisor for Sharing Hashtags. In Proceedings of the 2018 Web Conference (WWW), pp. 287–296. Cited by: §6.
 CrossVM Side Channels and Their Use to Extract Private Keys. In Proceedings of the 2012 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 305–316. Cited by: §6.
Supplementary Material A Synthesizing Nonmembers
When training the defense classifier, we can use as members and synthesize nonmembers based on . For instance, for each data sample in and each of its feature, we keep the feature value with a probability 0.9 and randomly sample a value from the corresponding data domain for the feature with a probability 0.1, which synthesizes a nonmember data sample. Then, we train the defense classifier using as members and the synthesized data samples as nonmembers. Figure 7 shows the comparison results on the Location dataset (binary features), where MemGuardS is the scenario where we synthesize the nonmembers for training the defense classifier. We observe that MemGuard and MemGuardS achieve similar performance. Our results show that MemGuard does not necessarily need to split the training dataset in order to train the defense classifier.