Bayesian optimization is a powerful tool for fine-tuning the hyper-parameters of a wide variety of machine learning models. The success of machine learning has led practitioners in diverse real-world settings to learn classifiers for practical problems. As machine learning becomes commonplace, Bayesian optimization becomes an attractive method for practitioners to automate the process of classifier hyper-parameter tuning. A key observation is that the data used for tuning models in these settings is often sensitive. Certain data such as genetic predisposition, personal email statistics, and car accident history, if not properly private, may be at risk of being inferred from Bayesian optimization outputs. To address this, we introduce methods for releasing the best hyper-parameters and classifier accuracy privately. Leveraging the strong theoretical guarantees of differential privacy and known Bayesian optimization convergence bounds, we prove that under a GP assumption these private quantities are also near-optimal. Finally, even if this assumption is not satisfied, we can use different smoothness guarantees to protect privacy.
Differentially Private Bayesian Optimization
Matt J. Kusner firstname.lastname@example.org
Jacob R. Gardner email@example.com
Roman Garnett firstname.lastname@example.org
Kilian Q. Weinberger email@example.com
Computer Science & Engineering, Washington University in St. Louis
Machine learning is increasingly used in application areas with sensitive data. For example, hospitals use machine learning to predict if a patient is likely to be readmitted soon (Yu et al., 2013), webmail providers classify spam emails from non-spam (Weinberger et al., 2009), and insurance providers forecast the extent of bodily injury in car crashes (Chong et al., 2005).
In these scenarios data cannot be shared legally, but companies and hospitals may want to share hyper-parameters and validation accuracies through publications or other means. However, data-holders must be careful, as even a small amount of information can compromise privacy.
Which hyper-parameter setting yields the highest accuracy can reveal sensitive information about individuals in the validation or training data set, reminiscent of reconstruction attacks described by Dwork & Roth (2013) and Dinur & Nissim (2003). For example, imagine updated hyper-parameters are released right after a prominent public figure is admitted to a hospital. If a hyper-parameter is known to correlate strongly with a particular disease the patient is suspected to have, an attacker could make a direct correlation between the hyper-parameter value and the individual.
To prevent this sort of attack, we develop a set of algorithms that automatically fine-tune the hyper-parameters of a machine learning algorithm while provably preserving differential privacy (Dwork et al., 2006b). Our approach leverages recent results on Bayesian optimization (Snoek et al., 2012; Hutter et al., 2011; Bergstra & Bengio, 2012; Gardner et al., 2014), training a Gaussian process (GP) (Rasmussen & Williams, 2006) to accurately predict and maximize the validation gain of hyper-parameter settings. We show that the GP model in Bayesian optimization allows us to release noisy final hyper-parameter settings to protect against aforementioned privacy attacks, while only sacrificing a tiny, bounded amount of validation gain.
Our privacy guarantees hold for releasing the best hyper-parameters and best validation gain. Specifically our contributions are as follows:
We derive, to the best of our knowledge, the first framework for Bayesian optimization with provable differential privacy guarantees,
We develop variations both with and without observation noise, and
We show that even if our validation gain is not drawn from a Gaussian process, we can guarantee differential privacy under different smoothness assumptions.
We begin with background on Bayesian optimization and differential privacy we will use to prove our guarantees.
In general, our aim will be to protect the privacy of a validation dataset of sensitive records (where is the collection of all possible records) when the results of Bayesian optimization depends on .
Our goal is to maximize an unknown function that depends on some validation dataset :
It is important to point out that all of our results hold for the general setting of eq. (1), but throughout the paper, we use the vocabulary of a common application: that of machine learning hyper-parameter tuning. In this case is the gain of a learning algorithm evaluated on validation dataset that was trained with hyper-parameters .
As evaluating is expensive (e.g., each evaluation requires training a learning algorithm), Bayesian optimization gives a procedure for selecting a small number of locations to sample : . Specifically, given a current sample , we observe a validation gain such that , where is Gaussian noise with possibly non-zero variance . Then, given and previously observed values , Bayesian optimization updates its belief of and samples a new hyper-parameter . Each step of the optimization proceeds in this way.
To decide which hyper-parameter to sample next, Bayesian optimization places a prior distribution over and updates it after every (possibly noisy) function observation. One popular prior distribution over functions is the Gaussian process (Rasmussen & Williams, 2006), parameterized by a mean function (we set , w.l.o.g.) and a kernel covariance function . Functions drawn from a Gaussian process have the property that any finite set of values of the function are normally distributed. Additionally, given samples and observations , the GP posterior mean and variance has a closed form:
where is evaluated element-wise on each of the columns of . As well, and is any hyper-parameter. As more samples are observed, the posterior mean function approaches .
where is a parameter that trades off the exploitation of maximizing and the exploration of maximizing . Srinivas et al. (2010) proved that given certain assumptions on and fixed, non-zero observation noise , selecting hyper-parameters to maximize eq. (3) is a no-regret Bayesian optimization procedure: , where is the maximizer of eq. (1). For the no-noise setting, de Freitas et al. (2012) give a UCB-based no-regret algorithm.
Alongside maximizing , we would like to guarantee that if depends on (sensitive) validation data, we can release information about so that the data remains private. Specifically, we may wish to release (a) our best guess of the true (unknown) maximizer and (b) our best guess of the true (also unknown) maximum objective . The primary question this work aims to answer is: how can we release private versions of and that are close to their true values, or better, the values and ? We give two answers to these questions. The first will make a Gaussian process assumption on , which we describe immediately below. The second, described in Section id1, will utilize Lipschitz and convexity assumptions to guarantee privacy in the event the GP assumption does not hold.
For our first answer to this question, let us define a Gaussian process over hyper-parameters and datasets as follows: . A prior of this form is known as a multi-task Gaussian process (Bonilla et al., 2008). Many choices for and are possible. The function defines a set kernel (e.g., a function of the number of records that differ between and ). For , we focus on either the squared exponential: or Matérn kernels: (e.g., for , , for ), for a fixed , as they have known bounds on the maximum information gain (Srinivas et al., 2010). Note that as defined, the kernel is normalized (i.e., ).
We have a problem of type (1), where all possible dataset functions are GP distributed for known kernels , for all and , where .
Similar Gaussian process assumptions have been made in previous work (Srinivas et al., 2010). For a result in the no-noise observation setting, we will make use of the assumptions of de Freitas et al. (2012) for our privacy guarantees, as described in Section id1.
One of the most widely accepted frameworks for private data release is differential privacy (Dwork et al., 2006b), which has been shown to be robust to a variety of privacy attacks (Ganta et al., 2008; Sweeney, 1997; Narayanan & Shmatikov, 2008). Given an algorithm that outputs a value when run on dataset , the goal of differential privacy is to ‘hide’ the effect of a small change in on the output of . Equivalently, an attacker should not be able to tell if a private record was swapped in just by looking at the output of . If two datasets differ by swapping a single element, we will refer to them as neighboring datasets. Note that any non-trivial algorithm (i.e., an algorithm that outputs different values on and for some pair ) must include some amount of randomness to guarantee such a change in is unobservable in the output of (Dwork & Roth, 2013). The level of privacy we wish to guarantee decides the amount of randomness we need to add to (better privacy requires increased randomness). Formally, the definition of differential privacy is stated below.
A randomized algorithm is -differentially private for if for all and for all neighboring datasets (i.e., such that and differ by swapping one record) we have that
The parameters guarantee how private is; the smaller, the more private. The maximum privacy is in which case eq. (4) holds with equality. This can be seen by the fact that and can be swapped in the definition, and thus the inequality holds in both directions. If , we say the algorithm is simply -differentially private. For a survey on differential privacy we refer the interested reader to Dwork & Roth (2013).
There are two popular methods for making an algorithm -differentially private: (a) the Laplace mechanism (Dwork et al., 2006b), in which we add random noise to and (b) the exponential mechanism (McSherry & Talwar, 2007), which draws a random output such that . For each mechanism we must define an intermediate quantity called the global sensitivity describing how much changes when changes.
(Laplace mechanism) The global sensitivity of an algorithm over all neighboring datasets (i.e., differ by swapping one record) is
(Exponential mechanism) The global sensitivity of a function over all neighboring datasets is
The Laplace mechanism hides the output of by perturbing its output with some amount of random noise.
Given a dataset and an algorithm , the Laplace mechanism returns , where is a noise variable drawn from , the Laplace distribution with scale parameter (and location parameter ).
The exponential mechanism draws a slightly different that is ‘close’ to , the output of .
Given a dataset and an algorithm , the exponential mechanism returns , where is drawn from the distribution , and is a normalizing constant.
Given , a possible set of hyper-parameters, we derive methods for privately releasing the best hyper-parameters and the best function values , approximately solving eq. (1). We first address the setting with observation noise in eq. (2) and then describe small modifications for the no-noise setting. For each setting we use the UCB sampling technique in eq. (3) to derive our private results.
In general cases of Bayesian optimization, observation noise occurs in a variety of real-world modeling settings such as sensor measurement prediction (Krause et al., 2008). In hyper-parameter tuning, noise in the validation gain may be as a result of noisy validation or training features.
In the sections that follow, although the quantities all depend on the validation dataset , for notational simplicity we will occasionally omit the subscript . Similarly, for we will often write: .
In this section we guarantee that releasing in Algorithm 1 is private (Theorem 1) and that it is near-optimal (Theorem 2). Our proof strategy is as follows: we will first demonstrate the global sensitivity of with probability at least . Then we will show show that releasing via the exponential mechanism is -differentially private. Finally, we prove that is close to , the true maximizer of eq. (1).
As a first step we bound the global sensitivity of as follows:
Given Assumption 1, for any two neighboring datasets and for all with probability at least there is an upper bound on the global sensitivity (in the exponential mechanism sense) of :
for , .
Proof. Note that, by applying the triangle inequality twice, for all ,
We can now bound each one of the terms in the summation on the right hand side (RHS) with probability at least . According to Srinivas et al. (2010), Lemma 5.1, we obtain . The same can be applied to . As , because , we can upper bound both terms by . In order to bound the remaining (middle) term on the RHS recall that for a random variable we have: . For variables , we have, by the union bound, that . If we set and , we obtain , which completes the proof.
We remark that all of the quantities in Theorem 1 are either given or selected by the modeler (e.g, ). Given this upper bound we can apply the exponential mechanism to release privately, as per Definition 1:
We leave the proof of Corollary 1 to the supplementary material. Even though we must release a noisy hyper-parameter setting , it is in fact near-optimal.
Proof. In general, the exponential mechanism selects that is close to the maximum (McSherry & Talwar, 2007):
with probability at least . Recall we assume that at each optimization step we observe noisy gain , where (with fixed noise variance ). As such, we can lower bound the term :
where the third line follows from Srinivas et al. (2010): Lemma 5.2 and the fourth line from the fact that .
As in the proof of Theorem 1, given a normal random variable we have: . Therefore if we set we have . This implies that (as defined in Algorithm 1) with probability at least . Thus, we can lower bound by . We can then lower bound in eq. (5) with the right hand side of eq. (6). Therefore, given the in Algorithm 1, Srinivas et al. (2010), Lemma 5.2 holds with probability at least and the theorem statement follows.
In this section we demonstrate releasing the validation gain in Algorithm 1 is private (Theorem 3) and that the noise we add to ensure privacy is bounded with high probability (Theorem 4). As in the previous section our approach will be to first derive the global sensitivity of the maximum found by Algorithm 1. Then we show releasing is -differentially private via the Laplace mechanism. Perhaps surprisingly, we also show that is close to .
We bound the global sensitivity of the maximum found with Bayesian optimization and UCB:
Proof. For notational simplicity let us denote the regret term as . Then from Theorem 1 in Srinivas et al. (2010) we have that
This implies with probability at least (with appropriate choice of ).
Recall that in the proof of Theorem 1 we showed that with probability at least (for given in Algorithm 1). This along with the above expression imply the following two sets of inequalities with probability greater than :
These, in turn, imply the two sets of inequalities:
This implies . That is, the global sensitivity of is bounded. Given the sensitivity of the maximum , we can readily derive the sensitivity of maximum . First note that we can use the triangle inequality to derive
We can immediately bound the final term on the right hand side. Note that as , the first two terms are bounded above by and , where (similarly for ). This is because, in the worst case, the observation noise shifts the observed maximum up or down by . Therefore, let if and otherwise, so that we have:
Although can be arbitrarily large, recall that for we have: . Therefore if we set we have . This implies that with probability at least . Therefore, if Theorem 1 from Srinivas et al. (2010) and the bound on hold together with probability at least as described above, the theorem follows directly.
As in Theorem 1 each quantity in the above bound is given in Algorithm 1 (, , ), given in previous results (Srinivas et al., 2010) (, ) or specified by the modeler (, ). Now that we have a bound on the sensitivity of the maximum we will use the Laplace mechanism to prove our privacy guarantee (proof in supplementary material):
Further, as the Laplace distribution has exponential tails, the noise we add to obtain is not too large:
Given the assumptions of Theorem 1, we have the following bound,
with probability at least for .
where the second and third inequality follow from the proof of Theorem 3 (using the regret bound of Srinivas et al. (2010): Theorem 1). Note that the third inequality holds with probability greater than (given in Algorithm 1). The final inequality implies . Also note that,
This implies that . Thus we have that . Finally, because could be arbitrarily large we give a high probability upper bound on for all . Recall that for we have by the tail probability bound and union bound that . Therefore, if we set and , we obtain . As defined .
We note that, because releasing either or is -differentially private, by Corollaries 1 and 2, releasing both private quantities in Algorithm 1 guarantees -differential privacy for validation dataset . This is due to the composition properties of -differential privacy (Dwork et al., 2006a) (in fact stronger composition results can be demonstrated, (Dwork & Roth, 2013)).
In hyper-parameter tuning it may be reasonable to assume that we can observe function evaluations exactly: . First note that we can use the same algorithm to report the maximum in the no-noise setting. Theorems 1 and 2 still hold (note that in Theorem 2). However, we cannot readily report a private maximum as the information gain in Theorems 3 and 4 approaches infinity as . Therefore, we extend results from the previous section to the exact observation case via the regret bounds of de Freitas et al. (2012). Algorithm 2 demonstrates how to privatize the maximum in the exact observation case.
We demonstrate that releasing in Algorithm 2 is private (Theorem 3) and that a small amount of noise is added to make private (Theorem 6). To do so, we derive the global sensitivity of in Algorithm 2 independent of the maximum information gain via de Freitas et al. (2012). Then we prove releasing is -differentially private and that is almost .
The following Theorem gives a bound on the global sensitivity of the maximum .
We leave the proof to the supplementary material.
Given this sensitivity, we may apply the Laplace mechanism to release .
Even though we must add noise to the maximum we show that is still close to the optimal .
We prove Corollary 3 and Theorem 6 in the supplementary material. We have demonstrated that in the noisy and noise-free settings we can release private near-optimal hyper-parameter settings and function evaluations . However, the analysis thus far assumes the hyper-parameter set is finite: . It is possible to relax this assumption, using an analysis similar to (Srinivas et al., 2010). We leave this analysis to the supplementary material.
Even if our our true validation score is not drawn from a Gaussian process (Assumption 1), we can still guarantee differential privacy for releasing its value after Bayesian optimization . In this section we describe a different functional assumption on that also yields differentially private Bayesian optimization for the case of machine learning hyper-parameter tuning.
Assume we have a (nonsensitive) training set , which, given a hyperparameter produces a model from the following optimization,
The function is a training loss function (e.g., logistic loss, hinge loss). Given a (sensitive) validation set we would like to use Bayesian optimization to maximize a validation score .
Algorithm 3 describes a procedure for privately releasing the best validation accuracy given assumption 2. Different from previous algorithms, we may run Bayesian optimization in Algorithm 3 with any acquisition function (e.g., expected improvement (Mockus et al., 1978), UCB) and privacy is still guaranteed.
Similar to Algorithms 1 and 2 we use the Laplace mechanism to mask the possible change in validation accuracy when is swapped with a neighboring validation set . Different from the work of Chaudhuri & Vinterbo (2013) changing to may also lead to Bayesian optimization searching different hyper-parameters, vs. . Therefore, we must bound the total global sensitivity of with respect to and ,
The total global sensitivity of over all neighboring datasets is
In the following theorem we demonstrate that we can bound the change in for arbitrary .
Given assumption 2, for neighboring and arbitrary we have that,
where is the Lipschitz constant of , , and is the size of .
Proof. Applying the triangle inequality yields
This second term is bounded by Chaudhuri & Vinterbo (2013) in the proof of Theorem 4. The only difference is, as we are not adding random noise to we have that }.
To bound the first term, let be the value of the objective in eq. (10) for a particular . Note that and are and -strongly convex. Define
Further, define the minimizers and . This implies that
Given that is -strongly convex (Shalev-Shwartz, 2007), and by the Cauchy-Schwartz inequality,
Now as is the minimizer of