Differentially Private Objective Perturbation:Beyond Smoothness and Convexity

# Differentially Private Objective Perturbation: Beyond Smoothness and Convexity

Seth Neel Wharton Statistics Department, University of Pennsylvania.    Aaron Roth Department of Computer and Information Sciences, University of Pennsylvania. This material is based upon work supported by the United States Air Force and DARPA under Contract No FA8750-16-C-0022. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force and DARPA.    Giuseppe Vietri Computer Science and Engineering Department, University of Minnesota.    Zhiwei Steven Wu Computer Science and Engineering Department, University of Minnesota. Supported in part by a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Mozilla research grant, and a Facebook Research Award.
###### Abstract

One of the most effective algorithms for differentially private learning and optimization is objective perturbation. This technique augments a given optimization problem (e.g. deriving from an ERM problem) with a random linear term, and then exactly solves it. However, to date, analyses of this approach crucially rely on the convexity and smoothness of the objective function. We give two algorithms that extend this approach substantially. The first algorithm requires nothing except boundedness of the loss function, and operates over a discrete domain. Its privacy and accuracy guarantees hold even without assuming convexity. The second algorithm operates over a continuous domain and requires only that the loss function be bounded and Lipschitz in its continuous parameter. Its privacy analysis does not even require convexity. Its accuracy analysis does require convexity, but does not require second order conditions like smoothness. We complement our theoretical results with an empirical evaluation of the non-convex case, in which we use an integer program solver as our optimization oracle. We find that for the problem of learning linear classifiers, directly optimizing for 0/1 loss using our approach can out-perform the more standard approach of privately optimizing a convex-surrogate loss function on the Adult dataset.

\addauthor

swblue

## 1 Introduction

Consider the general problem of optimizing a function defined with respect to a dataset and a parameter : . This general class of problems includes classical empirical risk minimization, amongst others, and is a basic problem in learning and optimization. We say that such a function is -sensitive in the dataset if changing one datapoint in can change the value of by at most 1, for any parameter value . Suppose that we want to solve an optimization problem like this subject to the constraint of differential privacy. The exponential mechanism provides a powerful, general-purpose, and often error-optimal method to solve this problem [MT07]. It requires no assumptions on the function other than that it is -sensitive (this is a minimal assumption for privacy: more generally, its guarantees are parameterized by the sensitivity of the function). Unfortunately, the exponential mechanism is generally infeasible to run: its implementation (and the implementation of related mechanisms, like “Report-Noisy-Max” [DR14]) requires the ability to enumerate the parameter range , making it infeasible in most learning settings, despite its use in proving general information theoretic bounds in private PAC learning [KLNRS08]. When is continuous, convex, and satisfies second order conditions like strong convexity or smoothness, the situation is better: there are a number of algorithms available, including simple output perturbation [objective1] and objective perturbation [objective1, objective2, objective3]. This partly mirrors the situation in non-private data analysis, in which convex optimization problems can be solved quickly and efficiently, and most non-convex problems are NP-hard in the worst case.

In the non-private case, however, the worst-case complexity of optimization problems does not tell the whole story. For many non-convex optimization problems, such as integer programming, there are fast heuristics that not only reliably succeed in optimizing functions deriving from real inputs, but can also certify their own success. In such settings, can we leverage these heuristics to obtain practical private optimization algorithms? In this paper, we give two novel analyses of objective perturbation algorithms that extend their applicability to 1-sensitive non-convex problems (and more generally, bounded sensitivity functions). We also get new results for convex problems, without the need for second order conditions like smoothness or strong convexity. Our first algorithm operates over a discrete parameter space , and requires no further assumptions beyond 1-sensitivity for either its privacy or accuracy analysis — i.e. it is comparable in generality to the exponential mechanism. The second algorithm operates over a continuous parameter space , and requires only that be Lipschitz-continuous in its second argument. Its privacy analysis does not require convexity. Its accuracy analysis does — but does not require any 2nd order conditions. We implement our first algorithm to directly optimize classification error over a discrete set of linear functions on the Adult dataset, and find that it substantially outperforms private logistic regression.

### 1.1 Related work

Objective perturbation was first introduced by [objective1], and analyzed for the special case of strongly convex functions. Its analysis was subsequently improved and generalized [objective2, objective3] to apply to smooth convex functions, and to tolerate a small degree of error in the optimization procedure. Our paper is the first to give an analysis of objective perturbation without the assumption of convexity, and the first to give an accuracy analysis without making second order assumptions on the objective function even in the convex case. [objective1] also introduced the related technique of output perturbation which perturbs the exact optimizer of a strongly convex function.

The work most closely related to our first algorithm is [neel2018use], who also give a similar “oracle efficient” algorithm for non-convex differentially private optimization: i.e. reductions from non-private optimization to private optimization. Their algorithm (“Report Separator Perturbed Noisy Max”, or RSPM) relies on an implicit perturbation of the optimization objective by augmenting the dataset with a random collection of examples drawn from a separator set. The algorithms which we introduce in this paper are substantially more general: because they directly perturb the objective, they do not rely on the existence of a small separator set for the class of functions in question. One of the contributions of our paper is the first experimental analysis of RSPM, in section 5. [neel2018use] also give a generic method to transform an algorithm (like ours) whose privacy analysis depends on the success of the optimization oracle, to an algorithm whose privacy analysis does not depend on this, whenever the optimization heuristic can certify its success (integer program solvers have this property). Their method applies to the algorithms we develop in this paper. Our second algorithm crucially uses an stability result recently proven by [suggala2019online] in the context of online learning.

## 2 Preliminaries

We first define a dataset, a loss function with respect to a dataset, and the two types of optimization oracles we will call upon. We then define differential privacy, and state basic properties.

A dataset is defined as a (multi)set of -Lipschitz loss functions . For in a parameter space , the loss on dataset is defined to be

 L(D,w)=∑l∈Dl(w)

We will define two types of perturbed loss functions, and the corresponding oracles which are assumed to be able to optimize each type. These will be used in our discrete objective perturbation algorithm in Section 3 and our sampling based objective perturbation algorithm in Section 4 respectively.

Given a vector , we define the perturbed loss to be:

This is simply the loss function augmented with a linear term.

Let be the projection formally defined in Section 3, which informally maps a -dimensional vector with norm at most to a unit vector in . Given a vector We define the perturbed projected loss to be:

 ¯Lπ(D,w,η)=L(D,w)−⟨η,π(w)⟩n
###### Definition 2.1 (Approximate Linear Optimization Oracle).

Given as input a dataset and a -dimensional vector , an -approximate linear optimization oracle returns such that

 ¯L(D,w∗,σ)≤infw∈W¯L(D,w,σ)+α

When we say is a linear optimization oracle.

###### Definition 2.2 (Approximate Projected Linear Optimization Oracle).

Given as input a dataset and a -dimensional vector , an -approximate projected linear optimization oracle returns such that

 ¯Lπ(D,w∗,σ)≤infw∈W¯Lπ(D,w,σ)+α

When we say is a projected linear optimization oracle. We remark that while it seems less natural to assume an oracle for the projected perturbed loss which involves the non-linearity , in Section D.2 we show how we can linearize this term by introducing an auxiliary variable and introducing a convex constraint. This is ultimately how we implement this oracle in our experiments.

###### Definition 2.3.

A randomized algorithm is an -minimizer for if for every dataset , with probability , it outputs such that:

Certain optimization routines will have guarantees only for discrete parameter spaces:

###### Definition 2.4 (Discrete parameter spaces).

A -separated discrete parameter space is a discrete set such that for any pair of distinct vectors we have .

Finally we define differential privacy.

We call two data sets neighbors (written as ) if can be derived from by replacing a single loss function with some other element of .

###### Definition 2.5 (Differential Privacy [Dmns06, Dkmmn06]).

Fix . A randomized algorithm is -differentially private (DP) if for every pair of neighboring data sets , and for every event :

 Pr[A(D)∈Ω]≤exp(ϵ)Pr[A(D′)∈Ω]+δ.

The Laplace distribution centered at with scale is the distribution with probability density function . We also make use of the exponential distribution which has density function if and otherwise.

## 3 Objective perturbation over a discrete decision space

In this section we give an objective perturbation algorithm that is -differentially private for any non-convex Lipschitz objective over a discrete decision space . We assume that each is -Lipschitz over w.r.t. norm: that is for any , . Note that if takes values in , then we know is also -Lipschitz due to the -separation in .

Let be a bound on the maximum norm of any vector in . We will make use of a projection onto the unit sphere in one higher dimension. The projection function is defined as:

 π(w)=(w1,…,wd,D√1−∥w∥22/D2)1D

Note that for all , and also that for any , . This shows that while projecting to the -dimensional sphere, can’t force points too much closer together than they start, which will be useful in the privacy proof.

We first prove an accuracy bound for OPDisc, which follows from a simple tail bound on the random linear perturbation term.

###### Theorem 1 (Utility).

Algorithm 1 is an ()-minimizer for with

 α=2√2(d+1)ln(4/β)σn=14GD2√2(d+1)ln(4/β)ln(1/δ)nτϵ
###### Proof.

For we have the following tail bound:

Now let where each is a Gaussian random variable with variance . It follows that for . With probability we have:

 d+1∑i=1ηi≤√2(d+1)ln(2/β)σ=C

Thus with probability ,

 supw∈Wτ⟨η,π(w)⟩≤supη/C∈Δd+1,||y||2=1⟨y,η⟩=sup||y||2=1C||y||∞≤C=√2(d+1)ln(2/β)σ

By symmetry with probability . Thus by a union bound, with probability .

Let be the output of algorithm 1 and be the minimizer for . Then with probability : and . Combining these two bounds we get:

 1nL(D,w)≤¯Lπ(D,w,η)+√2(d+1)ln(4/β)σn≤¯Lπ(D,^w,η)+√2(d+1)ln(4/β)σn≤L(D,^w)n+2√2(d+1)ln(4/β)σn (1)

The second inequality is because is the minimizer for the reguralized loss

We now prove OPDisc preserves -DP. We defer the full proof to the Appendix.

###### Theorem 2.

Algorithm 1 is -differentially private.

###### Proof Sketch.

For any realized noise vector , we write as the output. We first want to show that there exists a mapping such that is the parameter vector output on any neighboring dataset when the noise vector is realized as : that is, . If we can show that , then the probability of outputting any particular on input should be close to the corresponding probability, on input as desired.

Denote the set of of noise vectors that induce output on dataset by . Define our mapping:

 g^w(η)=η+2τGD2π(^w)

We now state key lemmas. First, Lemma 3 shows that our mapping preserves the minimizer even after switching to the adjacent dataset ; so long as the minimizer is unique.

###### Lemma 3.

Fix any and any pair of neighboring datasets . Let be such that is the unique minimizer . Then . Hence:

###### Proof.

Let . Suppose that is the output on neighboring dataset when the noise vector is . We will derive a contradiction. Since is the unique minimizer on :

 (2)

Let be the index where and are different, such that and . Then . Now, write the loss function in terms of and rearranging terms:

Since is a unique minimizer for and then term in the square bracket is positive. Hence:

 (li(^w)−li(v))−(l′i(^w)−l′i(v))+⟨cπ(^w),π(^w)−π(v)⟩<0

Since are -Lipschitz functions . Also, , by expanding and using . Substituting this becomes:

 −2G∥^w−v∥2+c2∥π(^w)−π(v)∥22<0

Since :

 −2G∥^w−v∥2+c2D2∥^w−v∥22<0c2D2∥^w−v∥2<2G% Divide both sides by ∥^w−v∥2c∥^w−v∥2<4GD2cτ<4GD2By assumption ^w≠v, hence ∥^w−v∥2≥τc<4GD2τDivide both sides% by τ (3)

This contradicts . ∎

Lemma 4 shows that the minimizer is unique with probability .

###### Lemma 4.

Fix any -separated vector space . For every dataset there is a subset such that and for any :

 ∃ a unique minimizer ^w∈argminw∈WτL(D,w)−⟨η,π(w)⟩

Finally Lemma 5 shows that with high probability over the draw of , .

###### Lemma 5.

Let . Then there exists a set such that , and for all if denotes the probability density function of :

 p(r)p(gw(r))≤eϵ

Finally, we focus on noise vectors in the set of , which has probability mass at least , and show that for any in that induces output solution on , the noise vector also induces on the neighbor . Then the -differential privacy guarantee essentially follows from the bounded ratio result in Lemma 5. ∎

### 3.1 Comparing OPDisc and RSPM

While both OPDisc and the RSPM algorithm of [neel2018use] require discrete parameter spaces, OPDisc is substantially more general in that it only requires the loss functions be Lipschitz, whereas RSPM assumes the loss functions are bounded in (and hence Lipschitz over ) and assumes the existence of a small separator set (defined in the supplement). Nevertheless, we might hope that in addition to greater generality, OPDisc has comparable or superior accuracy for natural classes of learning problems. We show this is indeed the case for the fundamental task of privately learning discrete hyperplanes, where it is better by a linear factor in the dimension. We define the RSPM algorithm, for which we must define the notion of a separator set, in the supplement.

###### Theorem 6 (RSPM Utility [neel2018use]).

Let be a discrete parameter space with a separator set of size . The Gaussian RSPM algorithm is an oracle-efficient -minimizer for for:

 α=O(m√mln(2m/β)ln(1/δ)ϵn)

Let be a discretization of , e.g. . Let be the subset of vectors in this discretization that lie within the unit Euclidean ball: . is -separated since any two distinct differ in at least one coordinate by at least . Moreover admits a separator set of size (see the Appendix of [neel2018use]. Since the loss functions and is -separated, the loss functions are -Lipschitz. By Theorem 6, RSPM has accuracy bound:

 αRSPM=O(d√dlog(d/βτ)log(1/δ)τ√τϵn)

By Theorem 1 OPDisc has accuracy bound:

 αOPDisc=O(√dlog(1/β)log(1/δ)nτ2ϵ)

Thus, in this case OPDisc has an accuracy bound that is better by a factor of roughly .

## 4 Objective perturbation for lipschitz functions

We now present an objective perturbation algorithm (paired with an additional output perturbation step), which applies to arbitrary parameter spaces. The privacy guarantee holds for (possibly non-convex) Lipschitz loss functions, while the accuracy guarantee applies only if the loss functions are convex and bounded. Even in the convex case, this is a substantially more general statement than was previously known for objective perturbation: we don’t require any second order conditions like strong convexity or smoothness (or even differentiability). Our guarantees also hold with access only to an -approximate optimization oracle.

We present the full algorithm in Algorithm 2. It 1) uses the approximate linear oracle (in Definition 2.1) to solve polynomially many perturbed optimization objectives, each with an independent random perturbation, and 2) perturbs the average of these solutions with Laplace noise.

Before we proceed to our analysis, let us first introduce some relevant parameters. Let have diameter , and diameter . We assume that the loss functions are -Lipschitz with respect to norm, and assume the loss functions are scaled to take values in . Our utility analysis requires convexity in the loss functions, and essentially follows from the high-probability bounds on the linear perturbation terms in the first stage and the output perturbation in the second stage.

###### Theorem 7 (Utility).

Assuming the loss functions are convex, Algorithm 2 is an -minimizer for with

 α′=O(d5/4GD∞√D2log(1/β)√ϵn+αlog(1/β)ϵ)

where is the approximation error of the oracle .

###### Proof.

For . By Theorem in [svante] which gives upper tail bounds for the sum of independent exponential random variables, we can conclude that with probability .

Then by -Lipschitzness with respect to the norm, with probability :

We now focus on . By the convexity of the loss functions, we have:

 1nL(1mm∑kwk,D)≤1mm∑k1nL(wk,D)

Since each is bounded in (since each ) and independent, by Hoeffding’s Inequality (see Appendix) with probability :

So it suffices to show that is small. Fix . Now by definition of , for any , we have

 1nL(w,D)−1n⟨w,σ⟩≤1nL(~w,D)−1n⟨~w,σ⟩+α,

hence

 1nL(w,D)−1nL(~w,D)≤1n⟨w−~w,σ⟩+α

, hence:

 Ew∗[1nL(w∗,D)]−argminw∈W1nL(w,D)≤1nD2E[||σ||2]

Now by Jensen’s inequality, , where the last equality is by the variance of the exponential distribution. Putting it all together, with probability :

Plugging in the value of , and expanding we get the following long expression:

 G(1+log(2/β))λϵ+γ√log(4/β)/2+α+1nD2√2dη=G(1+log(2/β))(4D∞γ+250ηGd2D2∞+α10G)ϵ+γ√log(4/β)/2+α+1nD2√2dη=γ(4GD∞(1+log(2/β))ϵ)+η(250G2d2D2∞(1+log(2/β))ϵ)+α((1+log(2/β))10ϵ)+γ(√log(4/β)/2)+1η(1nD2√2d)+α=γA+ηB+αC+γD+Eη+α(Setting placeholders A,B,C,D,E)=γ(A+D)+√BE+α(C+1)(η=√EB) (4)

The last step of equation 4 comes from replacing in the value of . Replacing back the values of results in:

 =γ(G(1+log(2/β))4D∞ϵ+√log(4/β)/2)+√250G2d2D2∞D2√2d(1+log(2/β))ϵn+α((1+log(2/β))10ϵ+1)

Finally, note that by the choice of the parameter , the first term has order at most that of the second term, which gives our stated bound. ∎

The privacy analysis of this algorithm crucially depends on a stability lemma proven by [suggala2019online] in the context of online learning, and does not require convexity.111Compared to the bound in [suggala2019online], our bound has an additional factor of 2 since our neighboring relationship in Definition 2.5 is defined via replacement whereas in [suggala2019online] the stability is defined in terms of adding another loss function.

###### Lemma 8 (Stability lemma [suggala2019online]).

For any pair of neighboring data sets . Let and be the output of an approximate oracle on datasets and respectively. Then,

 Eσ[||Oα(D,σ)−Oα(D′,σ)||1]≤250ηGd2D2∞+α10G

From now on, let be a sequence of of i.i.d -dimensional noise vectors and is the average output of calls to an -approximate oracle.

###### Lemma 9.

If , for , then, with probability :

 ∥W(D,Σ)−Eσ[Oα(D,σ)]∥1≤2D∞γ

where the randomness is taken over the different runs of .

The next lemma combines Lemma 8 and Lemma 9 to get high probability sensitivity bound for the average output of the approximate oracle.

###### Lemma 10 (High Probability ℓ1-sensitivity).

For any pair of neighboring datasets , let , be the sample average after calls to an -approximate oracle. Then, with probability over the random draws of ,

 ||W(D,Σ)−W(D′,Σ)||1≤4Dγ+250ηGd2D2+α10G (5)
###### Proof.

By lemma 9, If we run the approximate oracle times on each neighboring dataset , then by union bound we get that with probability :

 ∥W(D)−E[Oα(D,σ)]∥1≤2Dγ and ∥W(D′)−E[Oα(D′,σ)]∥1≤2Dγ

Adding both inequalities and applying the triangle inequality

 (6)

Lastly, by lemma 8,

 ∥W(D,Σ)−W(D′,Σ)∥1≤4Dγ+250ηGd2D2+α10G

###### Theorem 11.

Algorithm 2 is -differentially private.

###### Proof sketch.

Given a pair of neighboring data sets , we will condition on the set of noise vectors satisfy the -sensitivity bound (5), which occurs with probability at least . Then the privacy guarantee follows from the use of Laplace mechanism. ∎

###### Proof.

Fixing any two neighboring dataset , is the average of runs of with dataset and sequence of noise vectors . Let be a random -dimensional noise vector . We can write the output of algorithm 2 as a sum of two random variables:

 M(D,Σ,μ)=W(D,Σ)+μ

Following lemma 8, let and define set as

 B={Σ∈R(n,d):∥W(D,Σ)−W(D′,Σ)∥1≤λ}

Where is the -norm sensitivity bound from lemma 10. Then, by the same lemma, the probability that is less than , where is samples independently from the Exponential distribution. For any event ,

 Pr[M(D,Σ,μ)∈S]=Pr[M(D,Σ,μ)∈S∩Σ∈B]+Pr[M(D,Σ,μ)∈S∩Σ∉B]≤Pr[M(D,Σ,μ)∈S∩Σ∈B]+Pr[Σ∉B]≤Pr[M(D,Σ,μ)∈S∩Σ∈B]+δ (7)

We can can rewrite the joint probability as a conditional probability:

 Pr[M(D,Σ,μ)∈S∩Σ∈B]=Pr[M(D,Σ,μ)∈S|Σ∈B]Pr[Σ∈B]≤Pr[M(D,Σ,μ)∈S|Σ∈B] (8)
 (9)

Therefore,

## 5 Experiments

For our experiments we consider the problem of privately learning a linear threshold function to solve a binary classification task. Given a labeled data set where each and , the classification problem is to find a hyperplane that best separates the positive from the negative samples. A common approach is to optimize a convex surrogate loss function that approximates the classification loss. We use this approach (private logistic regression) as our baseline. In comparison, using our algorithm OPDisc, we instead try and directly optimize classification error over a discrete parameter space, using an integer program solver. Although this can be computationally expensive, we find that it is feasible for relatively small datasets (we use a balanced subset of the Adult dataset with roughly and features, after one-hot encodings of categorical features). In this setting, we find that OPDisc can substantially outperform private logistic regression. We remark that “small data” is the regime in which applying differential privacy is most challenging, and we view our approach as a promising way forward in this important setting.

#### Data description and pre-processing

We use the Adult dataset [Lichman:2013], a common benchmark dataset derived from Census data. The classification task is to predict whether an individual earns over 50K per year. The dataset has records and 14 features that are a mix of both categorical and continuous attributes.The Adult dataset is unbalanced: only 7841 individuals have the (positive) label. To arrive at a balanced dataset (so that constant functions achieve 50% error), we take all positive individuals, and an equal number of negative individuals selected at random, for a total dataset size of . We encode categorical features with one-hot encodings, which increases the dimensionality of the dataset. We found it difficult to run our algorithm with more than 30 features, and so we take a subset of 7 features from the Adult dataset that are represented by real valued features after one-hot encoding. We chose the subset of features to optimize the accuracy of our logistic regression baseline.

#### Baseline: private logistic regression (LR).

We use as our baseline private logistic regression which optimizes over the space of continuous halfspaces with the goal of minimizing the logistic loss function, given by . We implement a differentially private stochastic gradient descent (privateSGD) algorithm from [BST14, abadi2016deep], keeping track of privacy loss using the moment accountant method as implemented in the TensorFlow Privacy Library. The algorithm involves three parameters: gradient clip norm, mini-batch size, and learning rate. For each target privacy parameters , we run a grid search to identify the triplet of parameters that give the highest accuracy. To lower the variance of the accuracy, we also take average over all the iterates in the run of privateSGD.

#### Implementation details for OPDisc and RSPM

For both OPDisc and RSPM, we encode each record as a loss function: . For both algorithms, we have separation parameter and constrains the weight vectors to have norm bounded by . In OPDisc, each coordinate can take values in the discrete set with , and we constrain the to be at most . In RSPM, we optimize over the set . OPDisc requires an approximate projected linear optimization oracle (Definition 2.2) and RSPM requires an linear optimization oracle (Definition 2.1). In the appendix, we show that the optimization problems can be cast as mixed-integer programs (MIPs), allowing us to implement the oracles via the Gurobi MIP solver. The Gurobi solver was able to solve each of the integer programs we passed it.

#### Empirical evaluation.

We evaluate our algorithms by their () classification accuracy. The left side of Figure 0(a) plots the accuracy of OPDisc and our baseline (y-axis) as a function of the privacy parameter (x-axis), averaged over 15 runs. We fix for all three algorithms across all runs. The error bars report the empirical standard deviation. We see that both OPDisc and RSPM improve dramatically over the logistic regression baseline, showing that in small-data settings, it is possible to improve over the error/privacy tradeoff given by standard convex-surrogate approaches by appealing to non-convex optimization heuristics. OPDisc also obtains consistently better error than RSPM. The algorithm OPDisc also has significantly lower variance in its error compared to the other two algorithms. The right side of Figure 0(a) gives a histogram of the run-time of our three methods over the course of our experiment. For both OPDisc and RSPM, the running time is dominated by an integer-program solver. We see that while our method frequently completes quite quickly (often even beating our logistic regression baseline!), it has high variance, and occasionally requires a long time to run. In our experiments, we were always able to eventually solve the necessary optimization problem, however.

## Appendix A Definitions

###### Definition A.1 ([goldman1993exact, oracle16]).

A set is a separator set for a parameter space if for every pair of distinct parameters , there is an such that:

 l(w)≠l(w′)

If , then we say that has a separator set of size .

###### Definition A.2.

A weighted optimization oracle for a class is a function that takes as input a weighted dataset and outputs such that

 w∈argminw∗∈W∑(li,pi)∈WDpili(w)

## Appendix B Missing Proofs in Section 3

Proof of Lemma 4.

###### Proof.

Since is a discrete space, by a union bound it suffices to show that for any pair , . Since , they must differ in at least one coordinate . Condition on the realization of all of the coordinates of but the , Then , only if

 ηi=L(D,w′)−L(D,w)+∑j≠iηj(wj−wj