Generalized Gaussian Mechanism for Differential Privacy

Generalized Gaussian Mechanism for Differential Privacy

Fang Liu111Fang Liu is Associate Professor in the Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556 (E-mail: fang.liu.131@nd.edu). The work is supported by the NSF Grant 1546373 and the University of Notre Dame Faculty Research Support Program Initiation Grant.
Abstract

Assessment of disclosure risk is of paramount importance in the research and applications of data privacy techniques. The concept of differential privacy (DP) formalizes privacy in probabilistic terms and provides a robust concept for privacy protection without making assumptions about the background knowledge of adversaries. Practical applications of DP involve development of DP mechanisms to release results at a pre-specified privacy budget. In this paper, we generalize the widely used Laplace mechanism to the family of generalized Gaussian (GG) mechanism based on the global sensitivity of statistical queries. We explore the theoretical requirement for the GG mechanism to reach DP at prespecified privacy parameters, and investigate the connections and differences between the GG mechanism and the Exponential mechanism based on the GG distribution We also present a lower bound on the scale parameter of the Gaussian mechanism of -probabilistic DP as a special case of the GG mechanism, and compare the statistical utility of the sanitized results in the tail probability and dispersion in the Gaussian and Laplace mechanisms. Lastly, we apply the GG mechanism in 3 experiments (the mildew, Czech, adult data), and compare the accuracy of sanitized results via the distance and Kullback-Leibler divergence and examine how sanitization affects the prediction power of a classifier constructed with the sanitized data in the adult experiment.

Keywords: (probabilistic) differential privacy, global sensitivity, privacy budget, Laplace mechanism, Gaussian mechanism

1 Introduction

When releasing information publicly from a database or sharing data with collaborators, data collectors are always concerned about exposing sensitive personal information of individuals who contribute to the data. Even with key identifiers removed, data users may still identify a participant in a data set such as via linkage with public information. Differential privacy (DP) provides a strong privacy guarantee to data release without making assumptions about the background knowledge or behavior of data users [1, 2, 3]. For a given privacy budget, information released via a differentially private mechanism guarantees no additional personal information of an individual in the data can be inferred, regardless how much background information data users already possess about the individual. DP has spurred a great amount work in the development of differentially private mechanisms to release results and data, including the Laplace mechanism [1], the Exponential mechanism [4, 5], the medium mechanism [6], the multiplicative weights mechanism [7], the geometric mechanism [8], the staircase mechanism [9], the Gaussian mechanism [10], and applications of DP for private and secure inference in a Bayesian setting [11], among others.

In this paper, we unify the Laplace mechanism and the Gaussian mechanism in the framework of a general family, referred to as the generalized Gaussian (GG) mechanism. The GG mechanism is based on the global sensitivity (GS) of queries, a generalization of the GS. We demonstrate the nonexistence of a scale parameter that would lead to a GG mechanism of pure -DP in the case of if the results to be released are unbounded, but suggest the GG mechanism of -probabilistic DP (pDP) as an alternative in such cases. For bounded data we introduce the truncated GG mechanism and the boundary inflated truncated GG mechanism that satisfy pure -DP. We investigate the connections between the GG mechanism and the Exponential mechanism when the utility function in the latter is based on the Minkowski distance, and establish the relationship between the sensitivity of the utility function in the Exponential mechanism and the GS of queries. We then take a closer look at the Gaussian mechanism (the GG mechanism of order 2), and derive a lower bound on the scale parameter that delivers -pDP. The bound is tighter than the bound to satisfy -approximate DP (aDP) in the Gaussian mechanism [10], implying less noise being injected in the sanitized results. We compare the utility of sanitized results, in terms of the tail probability and dispersion or mean squared errors (MSE), from independent applications of the Gaussian mechanism and the Laplace mechanism. Finally, we run 3 experiments on the mildew, Czech, and adult data, respectively, and sanitize the count data via the Laplace mechanism, the Gaussian mechanisms of -pDP and -aDP. We compare the accuracy of sanitized results in terms of the distance and Kullback-Leibler divergence from the original results, and examine how sanitization affects the prediction accuracy of support vector machines constructed with the sanitized data in the adult experiment.

The rest of the paper is organized as follows. Section 2 defines the GS and presents the GG mechanism of -pDP, the truncated GG mechanism, and the boundary inflated truncated GG mechanism that satisfy pure -DP. It also connects and differentiates between the GG mechanisms and the Exponential mechanism when the utility function in the latter is based the Minkowski distance. Section 3 take a close look at the Gaussian mechanism of -pDP, and compares it with the Gaussian mechanism of -aDP. It also compares the tail probability and the dispersion of the noises injected via the Gaussian mechanism of -pDP and the Laplace mechanism. Section 4 presents the findings from the 3 experiments. Concluding remarks are given in Section 5.

2 Generalized Gaussian Mechanism

2.1 differential privacy (DP)

DP was proposed and formulated in Dwork [12] and Dwork et al. [1]. A perturbation algorithm gives -differential privacy if for all data sets that differ by only one individual (), and all possible query results to query ( denotes the output range of ),

(1)

where is the privacy budget parameter. refers to queries about data and , we also use it to denote the query results (unless stated otherwise, the domain of the query results is the set of all real numbers). is often defined in two ways in the DP community: and are of the same size and differ in exactly one record (row) in at least one attributes (columns); and is exactly the same as except that it has one less (more) record. Mathematically, Eqn (1) states that the probabilities of obtaining the same query result perturbed via are roughly the same regardless of whether the query is sent to or . In layman’s terms, DP implies the chance an individual will be identified based on the perturbed query result is very low since the query result would be about the same with or without the individual in the data. The degree of roughly the same is determined by the privacy budget . The lower is, the more similar the probabilities of obtaining the same query results from and are. DP provides a strong and robust privacy guarantee in the sense that it does not assume anything regarding the background knowledge or the behavior on data users.

In addition to the pure -DP in Eqn (1), there are softer versions of DP, including the -approximate DP (aDP) [13], the -probabilistic DP (pDP) [14], the -random DP (rDP) [15], and the -concentrated DP (cDP) [16]. In all the relaxed versions of DP, one additional parameter is employed to characterize the amount of relaxation on top of the privacy budget . Both the -aDP and the -pDP reduce to -DP when , but are different with respect to the interpretation of . In -aDP,

(2)

while a perturbation algorithm satisfies -pDP if

(3)

that is, the probability of generating an output belonging to the disclosure set is bounded below , where the disclosure set contains all the possible outputs that leak information for a given privacy budget . The fact that probabilities are within puts constraints on the values of , and in the framework of -aDP. By contrast, -pDP seems to be less constrained and more intuitive with its probabilistic flavor. When is small, -aDP and -aDP are roughly the same. The -rDP is also a probabilistic relaxation of DP; but it differs from -pDP in that the probabilistic relaxation is with respect to data generation. In -cDP, privacy cost is treated as a random variable with an expectation of and the probability of the actual cost ) is bounded by . The -cDP, similar to the -pDP, relaxes the satisfaction of DP with respect to and is broader in scope.

2.2 global sensitivity

Definition 1.

For all that is , the -global sensitivity (GS) of query is

(4)

In layman’s term, is the maximum difference measured by the Minkowski distance in query results between two neighboring data set with . The sensitivity is global since it is defined for all possible data sets and all possible ways that and differ by one. The higher is, the more disclosure risk there is on the individuals from releasing the original query results . The GS is a key concept in the construction of the generalized Gaussian mechanism in Section 2.

The GS is a generalization of the GS [12, 1] and the GS [10]. The difference between and measured by is the largest among all for since that for any real-valued vector and . In addition, is also the most sensitive measure given that the rate of change with respective to any is the largest among all . When is a scalar, for all . When is multi-dimensional, an easy upper bound for GS is the sum of the GS of each element in , by the triangle inequality. Lemma 2 gives an upper bound on for a general that includes as a special case (the proof is provided in Appendix A).

Lemma 2.

is an upper bound for where is the GS of .

The upper bound given in Lemma 2 can be conservative in cases where the change from to does not necessarily alter every entry in the multidimensional . For example, the GS of releasing a histogram with bins is 1 (if is defined as is one record less/more than ). In other words, the GS is not even though there are counts in the released histogram, but is the same as in releasing a single cell because removing one record only alters the count in a single bin.

It is obvious that each element in for needs to be bounded to obtain a finite . The most extreme case is the change from to makes jump from one extreme to the other, implying the range of can be used as an upper bound for , which, combined with Lemma 2, leads to the following claim.

Claim 3.

Denote the bounds of statistic by , both of which are finite. The GS and the GS for is .

2.3 generalized Gaussian distribution

The GG mechanism is defined based on the GG distribution GG with location parameter , scale parameter , shape parameter . The probability density function (pdf) is

The mean and variance of are and , respectively. ( is the Gamma function). When , the GG distribution is the Laplace distribution with mean and variance ; when , the GG distribution becomes the Gaussian distribution with mean and variance .

Figure 1: Density of GG distributions

Figure 1 presents some examples of the GG distributions at different . All the distributions in the left plot have the same scale and location , and those in the right plot have the same variance and location . When the scale parameter is the same (the left plot), the distributions become less spread as increases, and the Laplace distribution () looks very different from the rest. When the variance is the same (the right plot), the Laplace distribution is the most likely to generate values that are close to the mean, followed by the Gaussian distribution ().

2.4 GG mechanism of -Dp

We first examine the GG mechanism of -DP with the domain for defined on for . needs to bounded to calculate the GS, but the bounding requirement does not necessarily goes into formulating the GG distribution for the GG mechanism in the first place. If bounding for is necessary, it can be incorporated in a post-hoc manner after being generated from the GG mechanism. A well-known example is the Laplace mechanism. It employs a Laplace distribution defined on , though its scale parameter requires to be bounded for to be calculated.

Eqn (5) presents the GG distribution from which sanitized would be generated to satisfy -DP, assuming exists.

(5)
Claim 4.

There does not exist a lower bound on for the GG distribution in Eqn (5) when that generates with -DP. When , the lower bound on that leads to -DP is .

Appendix B lists the detailed steps that lead to Claim 4. In brief, to achieve -DP, we need (Eqn B.4). However, this inequality depends on the random GG noise for , the support of which is . In other words, there does not exist a random noise-free solution on , unless in which case the inequality no longer involves the error terms and the GG mechanism reduces to the familiar Laplace mechanism of -DP. We propose two approaches to fix the problem and achieve DP through the GG mechanism. The first approach leverages the bounding requirement for and builds in the requirement in the GG distribution in the first place to generate with -DP, assuming that and share the same bounded domain (Section 2.5). The second approach still uses the GG distribution in Eqn (5) to sanitize , only satisfying -pDP instead of the pure -DP (Section 2.6). The sanitized can be bounded in a post-hoc manner, as needed.

2.5 truncated GG mechanism and boundary inflated truncated GG mechanism of -Dp

Definition 5.

Denote the bounds on query result by . For integer , the truncated GG mechanism of order generates with -DP by drawing from the truncated GG distribution

(6)
(7)

where ( is the lower incomplete gamma function), is the GS of , and is the GS of .

The proof of -DP of the truncated GG mechanism is given in Appendix C. The truncated GG mechanism perturbs each element in independently; thus Eqn (6) involves the product of independent density functions. Though the closed interval is used to denote the bounds on , Definition 5 remains the same regardless of whether the interval is closed, open, or half-closed since the GG distribution is defined on a continuous domain. If is discrete in nature such as counts, post-hoc rounding on perturbed can be applied. The lower bound on in Eqn (7) depends on . We may apply Lemma 2 and set at its upper bound to obtain a less tight bound on .

(8)
Definition 6.

Denote the bounds on query result by for . For integer , the order boundary inflated truncated (BIT) GG mechanism sanitizes with -DP by drawing perturbed from the following piecewise distribution

(9)

where and , is the lower incomplete gamma function, and is the gamma function; and is the indicator function that equals 1 if the argument in the parentheses is true, 0 otherwise.

In brief, the BIT GG distribution replaces out-of-bound values with the boundary values and keeps the within-bound values as is, leading to a piecewise distribution. This is in contrast to the truncated GG distribution which throws away out-of-bound values. The challenge with perturbing directly via Eqn (9) lies in solving for a lower bound that satisfies -DP from

(10)

where and are the sanitized results from data and that are , respectively. The lower bound given in Eqns (7) and 8 can be used when the output subset is a subset of (open intervals). However, when is and , respectively, there are no analytical solutions on in either Eqns (11) or (12)

(11)
(12)

The most challenging situation is when is a mixture set of , , and for different . In summary, the BIT GG mechanism is not very appealing from a practical standpoint.

2.6 GG mechanism of -pDP

The second approach to obtain a lower bound on the scale parameter for the GG distribution in Eqn (5) when is to employ a soft version of DP. Corollary 7 presents a solution on that satisfies -pDP.

Corollary 7.

If the scale parameter in the GG distribution in Eqn (5) satisfies

(13)

then the GG mechanism satisfies -pDP when .

The proof is straightforward. Specifically, rather than setting the left side of Eqn (B.4) (i.e. with 100%), we attach a probability of achieving the inequality, that is, Pr(Eqn (B.4), leading to Eqn (13). The -pDP does not apply to the Laplace mechanism () at least in the framework laid out in Corollary 7. When , Eqn (B.1) becomes , which does not involve the random variable ; in other words, as long as , the pure -DP is guaranteed.

Corollary 7 does not list a closed-form solution on as it is likely that only numerical solutions exist in most cases. Given that is independent across , a function of , is also independent across . Therefore, the problem becomes searching for a lower bound on where the probability of a sum of independent variables () exceeding is smaller than . If there exists a closed-form distribution function for , an exact solution on can be obtained. When , an analytical lower bound can be obtained (see Section 3); when we only manage to obtain the distribution function for , but not for or at the current stage. A relatively simple case is when the elements of statistics are calculated on disjoint subsets of the original data, thus removing one individual from the data only affects one element out of , , leading to the Corollary 8.

Corollary 8.

When all elements in are based disjoint subsets of the data, the lower bound on satisfies , where .

When the query is a histogram, , and the lower bound for -pDP can be derived from . The proof of 8 is trivial. With disjoint queries, only one element in is affected by changing from to while the other elements in Eqn (B.2) in Appendix B are 0 as , and Eqn (B.2) .

Numerical approaches can be applied to obtain a lower bound on when the closed-form solutions are difficult to attain. Figure 2 depicts the lower bounds on at different and obtained via the Monte Carlo approach. We set at for , respectively and applied Lemma 2 to obtain an upper bound on for a given value. As expected, the lower bound on increases with decreased (lower privacy budget) and decreased (reduced chance of failing the pure -DP). The results also suggest increases with to maintain -pDP in the examined scenarios.

Figure 2: Numerical Lower bound on from Corollary 7

sampled from the GG mechanism of -pPD in Eqn (5) once is determined – analytically or numerically – ranges . To bound , it is straightforward to apply a post processing procedure such as the truncation and the boundary inflated truncation (BIT) procedure [17]. The truncation procedure throws away the out-of-bounds values and only keeps those in bounds while the BIT procedure sets the out-of-bounds values at the bounds. If the bounds are noninformative in the sense that the bounds are global and do not contain any data-specific information, then neither one of the two post-hoc bounding procedures will leak the original information or compromise the established -pDP.

2.7 Connection between GG mechanism and Exponential Mechanism

The exponential mechanism was introduced by McSherry and Talwar [4]. We paraphrase the original definition as follows, covering both discrete and continuous outcomes. Let denote the set containing all possible output . The exponential mechanism releases with probability

(14)

to ensure -DP. is a normalizing constant so that sums or integrates to 1, and equals to or , depending on whether is a countable/discrete sample space, or a continuous set, respectively. is the utility function and assigns a utility score to each possible outcome conditional on the original data , and is the maximum change in the utility score across all possible output and all possible data sets and that is . From a practical perspective, the scores should properly reflect the usefulness of . For example, usefulness can be measured the similarity between perturbed and original if is numerical. The closer is to the original , the larger is, and the higher the probability will be released. The Exponential mechanism can be conservative (See Appendix D), in the sense that the actual privacy cost is lower than the nominal privacy budget , or more than necessary amount of perturbation is injected to preserve -DP. Despite the conservativeness, the Exponential mechanism is a widely used mechanism in DP with its generality and flexibility as long as the utility function is properly designed.

When is defined as the negative power of the -order Minkowski distance between and , that is, , the Exponential mechanism generates perturbed from the GG distribution

(15)

with and . The scale parameter in Eqn (15) is a function of the GS of the utility function and the privacy budget . For bounded data for , the Exponential mechanism based on the GG distribution is

(16)

where is calculated from the pdf . Compared to the truncated GG mechanism in Definition 5, the only difference in the Exponential mechanism in Eqn (16) is how the scale parameter is defined. In Definition 5, depends on the GS of () while it is a function of the GS of the utility function () in the Exponential mechanism. Specifically, in the Exponential mechanism, and the lower bound on is given in Eqn (7) in the GG mechanism. While both mechanisms will lead to the satisfaction of -DP, the one with a smaller is preferable at the same . The magnitude of in each case depends on the bounds of , and the order , in addition to or . Though not a direct comparison on , Lemma 9 explores the relationship between and , with the hope to shed light on the comparison of (the proof is in Appendix E).

Lemma 9.

Let denote the bounds on for .

  1. When , . Both the GG mechanism and the GG-distribution based Exponential mechanism reduce to the truncated Laplace mechanism with the same .

  2. When , .

  3. When for , , where is GS of .

As a final note on the GG-distribution based Exponential mechanism, we did not use the negative Minkowski distance directly as the utility function due to a couple of potential practical difficulties with this approach. First, can be difficulty to obtain. Second, , does not appear to be associated with any known distributions (except when ), and additional efforts are required to study the properties of and to develop an efficient algorithm to draw samples from it.

3 Gaussian Mechanism

A special case of the GG mechanism is the Gaussian mechanism when that draws independently from a Gaussian distribution with mean and variance for . Applying Eqn (6) with defined in Eqns (7) and (8), we can obtain the truncated Gaussian mechanism of -DP for bounded

(17)

where and are the pdf and the CDF of the Gaussian distribution, respectively.

An analytical solution on the lower bound of for the Gaussian mechanism of -pDP is provided in Lemma 10 (the proof is provided in Appendix F).

Lemma 10.

The lower bound on the scale parameter from the Gaussian mechanism of -pDP is .

Given the relationship between and the standard deviation of the Gaussian distribution , the lower bound can also be expressed in ,

(18)

The pDP lower bound given in Eqn (18) is different from the lower bound

(19)

in Dwork and Roth [10] for -aDP (Eqn (2)). The pDP bound in Eqn (18) is tighter than the aDP bound in Eqn (19) for the same set of (note the interpretation of in pDP and aDP is different, but the DP guarantee is roughly the same when is small). In addition, the pDP bound does not constrain to be as required in the aDP bound. Figure 3 compares the two two lower bounds at several and . As observed, the ratio between the aPD vs. pDP lower bounds is always for the same . The smaller is, or the larger is, the smaller the ratio is and the larger the difference is between the two bounds.

Figure 3: Comparison of pDP lower bound (Eqn 18) vs. aDP bound (Eqn 19) on in the Gaussian mechanism for (the aDP bound requires )

Dwork and Roth [10] list several advantages of the Gaussian noises, such as the Gaussian noise is a familiar type of noise as many noise sources in real life can be well approximated by Gaussian distributions; the sum of Gaussian variable is still a Gaussian; and finally, in the case of multiple queries or when is small, the pure-DP guarantee in the Laplace mechanism and the pDP guarantee in the Gaussian mechanism see minimal difference. A theoretical disadvantage to Gaussian noise is that it does not guarantee DP in some cases (e.g., Report Noisy Max)[10].

We investigate the accuracy of by examining the tail probability and the dispersion of the noises injected via the -DP Laplace mechanism and the -pDP Gaussian mechanism. Denote the noise drawn from the Laplace distribution by and that from the Gaussian distribution by . The location parameters of both are ; the tail probability in the Laplace distribution and in the Gaussian distribution, where is given in Eqn (18). Since the CDF does not have a close-formed expression, we examine several numerical examples to compare and (Figure 4). We set to be the same (0.1, 1, 2, respectively) between the two mechanisms and examine for the -pDP Gaussian mechanism. If the ratio is , it implies that the Laplace mechanism is less likely to generate more extreme compared to the Gaussian mechanism at the same privacy specification of . We should focus on the meaningful cases where noise at least has a non-ignorable chance to occur in either mechanism. We used cutoff ; that is, either or (other cutoffs can be used, depending on how “unlikely” is defined). It is interesting to observe that after the initial take-off at 1 when , the ratio decreases until it hits the bottom and then bounds back with some cases eventually exceeding 1 at some value of , depending on the privacy parameter specification. The smaller or is, the longer it takes for the bounce-back to occurs. The observation suggests that the Laplace mechanism is in some cases more likekly to generate sanitized results that are far away from .

Figure 4: Ratio on the tail probabilities (the gray curves represent the unlikely cases where both and are )

We also compare the privacy parameter between the two mechanisms when both have the same tail probability. Figure 5 shows the calculated value associated with the Gaussian mechanism of -DP for a given that yields with the Laplace mechanism of -DP. If the ratio of at some and a small and somewhat ignorable , it implies the same tail probability can be achieved with less privacy cost with the Gaussian mechanism compared to the Laplace mechanism. Figure 5 suggests that at the same , the more relaxation of the pure -DP is allowed (i.e., the larger is), the smaller is (relative to baseline ), which expected as the and together determine the noise released in the Gaussian mechanism.

Figure 5: Relative privacy cost (the gray curves represent the unlikely cases where both and are )

Lemma 11 presents the precision comparison of between the Laplace mechanism of -DP and the Gaussian mechanism of -pDP. With the same location parameter in the Laplace and Gaussian distributions, a larger precision is equivalent to a smaller mean squared error (MSE).

Lemma 11.

Between the Gaussian mechanism of -pDP and the Laplace mechanism of -DP for sanitizing a statistic , when , the variance of the Gaussian distribution in the Gaussian mechanism is always greater than the variance of the Laplace distribution associated with the Laplace mechanism.

The proof is provided in Appendix G. Lemma 11 suggests that there is more dispersion in the perturbed released by the Gaussian mechanism of -pDP than the Laplace mechanism of -DP. In other words, if there are multiple sets of released via the Gaussian and the Laplace mechanisms respectively, then the former sets would have a wider spread than the latter. Since -pDP provides less privacy protection than -pDP, together with the larger MSE, it can be argued that the Laplace mechanism is superior to the Gaussian mechanism (which is also reflected in the 3 experiments in Section 4). It should be noted that in Lemma 11 is a sufficient but not necessary condition. In other words, the Gaussian mechanism may not be less dispersed than the Laplace mechanism when . Furthermore, since needs to be small to provide sufficient privacy protection in the setting of -pDP, it is very unlikely to have in practical applications. Also noted is that the setting explored in Lemma 11, where the focus is on examining the precision (dispersion) of a single perturbed statistic given the specificized privacy parameters and the original statistics when the sample size of a data set is public, is different from the recent work on the bounds of sample complexity (required sample size) to reach a certain level of a statistical accuracy in perturbed results with -DP or -aDP [18] (more discussions are provided in Section 5 on this point).

4 Experiments

We run three experiments on the mildew data set, the Czech data set, and the Census Income data set; a.k.a. the adult data. The mildew data contains information of parental alleles at 6 loci on the chromosome for 70 strands of barley powder mildew[19]. Each loci has two levels, yielding a very sparse 6-way cross-tabulation (22 cells out of the 64 are non-empty with low frequencies in many other cells). The Czech data contains data collected on 6 potential risk factors for coronary thrombosis for 1841 workers in a Czechoslovakian car factory [19]. Each risk factor has 2 levels (Y or N). The cross-tabulation is also 6-way with 64 cells, the same as the mildew data, but table is not as sparse with the large (only one empty cell). The adult data was extracted from the 1994 US Census database to yield a set of reasonably clean records that satisfy a set of conditions[20]. The data set is often used to test classifiers by predicting whether a person makes over 50K a year. We used only the completers in the adult data (with no missing values on the attributes) and then split them to 2/3 training (20009 subjects) and 1/3 testing (10005 subjects).

Figure 6: sanitized vs. original cell counts in the mildew data
Figure 7: distance and KL divergence between sanitized and original counts in the mildew data
Figure 8: sanitized vs. original cell counts in the Czech data
Figure 9: distance and KL divergence between sanitized and original counts in the Czech data

In each experiment, we run the Laplace mechanism of -DP, the Gaussian mechanism of -pDP presented in Section 3, and the Gaussian mechanism of of -aDP [10] to sanitize count data. We examined and . To examine the variation of noises, we run 500 repeats and computed the means and standard deviations of distances between the sanitized and the original counts and the Kullback-Leibler (KL) divergence between the empirical distributions of the synthetic data and the original data over the 500 repeats. In addition, we tested the GG mechanism of order 3 () in the mildew data, and compared the classification accuracy of the income outcome in the testing data set in the adult experiment based on the support vector machines (SVMs) trained with the original training data and the sanitized training data, respectively. The KL distance was calculated using the KL.Dirichlet command in R package entropy that computes a Bayesian estimate of the KL divergence. The SVMs were trained using the svm command in R package e1071. In all experiments, for all since the released query is a histogram and the bin counts are based on disjoint subsets of data. The scale parameters of the Laplace mechanism and the Gaussian mechanisms were obtained analytically (, Eqns (18) and (19), respectively), the grid search and the MC approach were applied to obtain the lower bound for GGM-3 via Corollary 8. In the mildew and Czech experiments, we sanitized all bins in the histograms, including the empty bins, assuming all combinations of the 6 attributes in each case are practically meaningful (in other words, the empty cells are sample zeros rather than population zeros). In the adult data, there are 14 attributes and bins in the 14-attribute histogram, a non-ignorable portion of which do not make any practical sense (e.g., a 90-age works hours per week). For simplicity, we only sanitized the 17,985 nonempty cells in the training data. After the sanitization, we set the out-of-bounds synthetic counts at 0 and those at , respectively, and normalized the sanitized counts to sum up to the original sample size in all 3 experiments, assuming itself is public or does not carry privacy information.

Figure 10: sanitized vs. original cell counts in the adult data
Figure 11: distance and KL divergence between sanitized and original counts in the adult data
Figure 12: Prediction accuracy in testing data via SVMs trained on sanitized and original data in the adult data

The results are given in Figures 6 to 12. In Figures 6, 8 and 10, the closer the points are to the identity line, the more similar are the original and sanitized counts. The Laplace sanitizer is the obvious winner in all 3 cases, producing the sanitized counts closest to the original with the smallest error and the KL divergence, followed by the Gaussian mechanism of -pDP, and GGM3 of -pDP in the mildew data; the Gaussian mechanism of -aDP is the worst. In the mildew experiment, the performance of the Gaussian mechanism of -pDP is similar when or . The decrease in the error and the KL divergence seems to decrease more or less in a linear manner as increases from 0.5 to 1 to 2, while the impact of seemed to have less a profound impact on the error and the KL divergence. In the Czech experiment, the sanitized counts approach the original counts more quickly than the mildew case with increased and , but there is significantly more variability for small (0.1); and the error and the KL divergence no longer decreases in a linear fashion, but drastically from to 1 and much less from to 2. The differences in the results between the mildew and the Czech experiments can be explained by the larger in the latter. In the adult experiment, Figure 12 suggests the prediction accuracy via the SVMs built on sanitized data is barely affected compared to the original accuracy regardless of the mechanism.There are some decreases in the accuracy rates from the original, but they are largely ignorable (on the scale of 0.25% to 1%), even with the variation take into account. In addition, the Gaussian mechanism of -aDP, though being the worst in preserving the original counts measured the distance and KL divergence, is no worse than the two Gaussian mechanisms in prediction.

5 Discussion

We introduced a new concept of the GS, and unified the Laplace mechanism and the Gaussian mechanism in the family of the GG mechanism. For bounded data, we discussed the truncated and the BIT GG mechanisms to achieve -DP. We also proposed -pDP as an alternative paradigm to the pure -DP for the GG mechanism for order . We showed the connections and distinctions between the GG mechanism and the Exponential mechanism when the utility function is defined as the negative -power of the Minkowski distance between the original and sanitized results. We also presented the Gaussian mechanism as an example of the GG mechanism and derived a lower bound for the scale parameter of the associated Gaussian distribution to achieve -pDP. The bound is tighter than the lower bound for the Gaussian mechanism of -aDP. We compared the tail probability and the dispersion of the the noise generated via the Gaussian mechanism of -pDP and the Laplace mechanism. We finally applied the Gaussian mechanisms of -pDP and -aDP and the Laplace mechanism of -DP in three real-life data sets.

The GG mechanism is based on the global sensitivity of query results in the sense that the sensitivity is independent of any specific data. Though the employment of the GS is robust in terms of privacy protection, it could result in a large amount of noises being injected to query results. There is work that allows the sensitivity of a query to vary with data (local sensitivity) [21, 22] with the purpose to increase the accuracy of sanitized results. How to develop the GG mechanism in the context of local sensitivity is a topic for future investigation.

The setting for the examination on the tail probability and dispersion in Section 3 is different from, though related to, the work on upper and lower bounds on sample complexity – the required sample size to reach a certain level of accuracy and privacy guarantee for count queries [23, 24, 18]. often refers to the accuracy of perturbed results in the DP literature, such as the worst case accuracy or average accuracy and might also refer to the tail probability and the MSE of released data, among others. A differential privacy mechanism is characterized by (and ) for privacy guarantee, to measure information preservation and utility of sanitized results, and the sample size of original data. The existing work on sample complexity focuses on bounding given (and and , while the results in Section 3 focus on the the accuracy and precision of sanitized results given (and and . If the bias from perturbed results (relative to the original results) are the same between the two mechanisms, a larger precision is equivalent to a smaller MSE.

Appendix

Appendix A Proof of Lemma 2