Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanisms with Noninformative Bounding
Protection of individual privacy is a common concern when releasing and sharing data and information. Differential privacy (DP) formalizes privacy in probabilistic terms without making assumptions about the background knowledge of data intruders, and thus provides a robust concept for privacy protection. Practical applications of DP involve development of differentially private mechanisms to generate sanitized results at a pre-specified privacy budget. In the sanitization of bounded statistics such as proportions and correlation coefficients, the bounding constraints will need to be incorporated in the differentially private mechanisms. There has been little work in examining the consequences of the incorporation of of bounding constraints on the accuracy of sanitized results and the statistical inferences based on the sanitized results from a differentially private mechanism. In this paper, we define noninformative bounding procedures and formalize the differentially private truncated and boundary inflated truncated (BIT) mechanisms for releasing statistics with bounding constraints. The impacts of the noninformative truncated and BIT Laplace mechanisms on the statistical accuracy and utility of sanitized statistics, including bias, asymptotic consistency, and mean squared error, are evaluated both theoretically and empirically via simulation studies. We also provided an upper bound for the mean squared error between the sanitized and original results for definite in the truncated Laplace and BIT Laplace mechanism; the bound goes to 0 if the scale parameter goes to 0 as .
keywords: global sensitivity, truncated mechanism, boundary inflated truncated (BIT) mechanism, bias, consistency, mean squared error
Protection of individual privacy is always a concern when releasing and sharing data and information. A data release mechanism aims to provide useful information to the public without compromising individual privacy. Differential privacy (DP) is a concept developed by theoretical computer scientists (Dwork et al., 2006b; Dwork, 2008, 2011) that has gained great popularity in recent years in both theoretical research and practical applications. DP formalizes privacy in mathematical terms without making assumptions about the background knowledge of data intruders and thus provides a robust concept for privacy protection. Practical applications of DP involve development of differentially private mechanisms, also referred to as sanitizers, through which original results are processed and converted to results that do not reveal individual information at a pre-specified privacy budget. There are general differentially private mechanisms such as the Laplace mechanism (Dwork et al., 2006b), the Exponential mechanism (McSherry and Talwar, 2007; McSherry, 2009), and more recently, the staircase mechanism (Geng et al., 2015), the generalized Gaussian mechanism (Liu, 2016), and the adaptive mechanisms such as the multiplicative weighting mechanism (Hardt et al., 2012) and the median mechanism (Roth and Roughgarden, 2010) for sanitizing multiple correlated queries. There are also differentially private mechanisms targeting specifically at certain statistical analyses such as robust and efficient point estimators (Dwork and Smith, 2010; Dwork, 2011), principle component analysis (Chaudhuri et al., 2012), linear and penalized regression (Chaudhuri et al., 2011; Kifer et al., 2012), Bayesian inferences of probabilistic graphical models (Zhang et al., 2015), machine learning, data mining, and big data analytics in genomics, healthcare, biometrics (Blum et al., 2008; Mohammed et al., 2011; Yu et al., 2014; Lin et al., 2016; Sadhya and Singh, 2016), among others.
In the context of DP, it is often assumed that data and statistics (query results) are bounded globally. It is technically difficult to apply DP, at least in some differentially private mechanisms, to perturb unbounded statistics while ensuring some usefulness of the sanitized results. The employment of “global” and data-invariant bounds rather than data-specific “local” bounds is one of the reasons underlying the robustness of DP against the worst-case privacy attack without assuming what the attackers have and how good they are (if the local bounds, which are functions of the data at hand, were to be used, then the bounds themselves would leak information about the original data, if the bounds themselves were not sanitized). Some statistics are naturally bounded, such as proportions and correlation coefficients, regardless of data. Real-life data in general support the assumption of bounded data, providing a practical basis for the applications of differentially private sanitization algorithms that rely on data boundedness. For example, counts and frequencies formed from categorical attributes are bounded within for a given sample size ; it is safe to say human height is bounded within cm; and the number of car accidents per day in a city is bounded within , where , the maximum possible car accidents, is a known finite number though might not be a tight upper bound. In parametric modelling, though numerical attributes are often modelled via distributions with unbounded domains, the distributional assumptions are in many cases only approximate and the probabilities of out-of-bounds values are often small enough to be ignorable under these assumptions. For example, though human height is often modelled by Gaussian distributions with a support of , (height cm or cm) under the Gaussian assumption; in the car accident example, (number of per-day accidents ) under the assumed Poisson distribution. If all attributes in a data set are bounded, statistics from either descriptive or inferential procedures based on that data set in general are also bounded. For example, if a numerical attribute is bounded within , so is its sample mean, and its variance is bounded within for a given sample size .
The bounding condition of a statistic can be incorporated when designing a differentially private mechanism. Barak et al. (2007) employed linear programming (and the Fourier transformation) to obtain a non-negative and consistent sanitized contingency tables. Li et al. (2015) investigated an extension to the matrix mechanism they proposed that incorporates nonnegativity constraints when realizing count queries. In the Exponential mechanism, null utility can be assigned to out-of-bounds values so that the probability of releasing illegitimate out-of-bounds values is 0. In other mechanisms such as the Laplace and Gaussian mechanisms that release sanitized results from the real line (), some choose to ignore the bounding conditions and release the raw sanitized results whether they are out-of-bounds or not, which is not recommended given that out-of-bounds values carry no practical meaning, or post-hoc “legitimize” out-of-bounds sanitized results before release, such as by setting them at the boundaries or throwing them away and re-sanitizing until a legitimate value is obtained.
There are few works examining how bounding procedures, while satisfying DP, affect the utility and validity of sanitized results relative to their originals. In this paper, we take a close and systematic look at the incorporation of bounding constraints in differentially private mechanisms and examine the effects of two bounding procedures on the utility of released data. Specifically, we define noninformative bounding, demonstrate the applications of the truncation and boundary inflated truncation (BIT) bounding procedures in two simulation studies, and assess their impact on the statistical accuracy of sanitized results both theoretically and empirically in the context of the Laplace mechanism with noninformative bounding.
Releasing statistics with bounding conditions falls under the umbrella of the constrained inferences in DP – more precisely speaking, the inequality constraints (not to be confused with the classical statistical inferences, the “inferences” in the constrained inference literature seem to concern about finding a set of constrained results that are “optimal” estimators of the sanitized results by some criteria, such as the distance, subject to a set of pre-defined and known constrains). In addition to the inequality constraints, statistics may also be subject to equality constraints simultaneously. A typical example is the release of proportions that is subject to both the equality constraint and the inequality constraints for . As another example, the release of 3 bounded statistics , , where , is also subject to the equality and inequality constraints. The differentially private data release with equality and inequality constraints can be formulated as a constrained optimization problem. Hay et al. (2010) were able to boost the accuracy of sanitized histograms (measured by the mean squared error between the sanitized and original results) by incorporating the rank constrain in unattributed histograms and the equality constrained in universal histograms. Qardaji et al. (2013) showed that combining the choice of a good branching factor with constrained inference can further boost the accuracy of a sanitized histogram. Li et al. (2015) proposed the matrix mechanism and one extension to incorporate nonnegativity constraints. In one of the two simulation studies in this discussion, we combined the truncated/BIT procedures to satisfy the bounding constraint and a slightly modified procedure from Hay et al. (2010) to satisfy the equality constraint in releasing a vector of proportions that sum to 1. We compared this hybrid procedure with an intuitive rescaling procedure in data utility and found some interesting results. The simulation study was a small empirical attempt for releasing differentially private statistics with both equality and inequality constraints; we plan to continue to work on developing general and efficient mechanisms that release optimal sanitized results while satisfying both types of constraints.
The rest of the paper is organized as follows. Section 2 overviews the concepts of DP and some general differentially private mechanisms. Section 3 presents the the truncation and BIT bounding procedures with noninformative bounding. Section 4 investigates the impact of noninformative bounding procedures on the utility of sanitized results in terms of bias, mean squared error, and asymptotic consistency. Section 5 illustrates the applications of noninformative truncation and BIT Laplace mechanisms and examines the statistical properties of the sanitized results in two simulation studies. The paper concludes in Section 6 with some final remarks and plans for future works.
DP is defined as follows (Dwork, 2006; Dwork et al., 2006b): a sanitization/perturbation algorithm is -differentially private if for all data sets that is and all possible result subset to a query , where is the privacy budget parameter and denotes that data differs from by only one individual (there are two commonly used definitions on “differing by one”; see the online supplementary materials). What’s released after the sanitization via is some perturbed results to statistical queries sent to the data. Under DP, the probabilities of obtaining the same query results from and after sanitization via are about the same – the ratio between which is bounded within – a neighborhood around 1. DP guarantees individual privacy protection at a given since the chance a participant in the data set will be identified based on query results sanitized via is very low given that the query results are about the same with or without that individual in the data set. DP provides a robust and powerful model against privacy attacks in the sense that it does not make assumptions on the background knowledge or the behavior on data intruders. can be used as a tuning parameter – the smaller is, the more protection there is on the released data via . In addition to the -DP, there are softer versions of DP, including the -approximate DP (aDP) (Dwork et al., 2006a), the -probabilistic DP (pDP) (Machanavajjhala et al., 2008), the random DP (rDP) (Hall et al., 2012), and the -concentrated DP (cDP) (Dwork and Rothblum, 2016). In all the relaxed versions, one additional parameter is employed to characterize the amount of relaxation on top of the privacy budget . In -aDP, . A sanitization algorithm satisfies -pDP if the probability of generating an output belonging to the disclosure set is bounded below , where the disclosure set contains all the possible outputs that leak information for a given privacy tolerance . The -rDP is also a probabilistic relaxation of DP; but it differs from -pDP in that the probabilistic relaxation is with respect to the generation of the data while it is with respect to the sanitizer in the -pDP. The -cDP, similar to the -pDP, relaxes the satisfaction of DP with respect to the sanitizer, and ensures that the expected privacy cost is and (Prob(the actual cost )) is bounded by .
The Laplace mechanism is a popular sanitizer to release statistics with -DP (Dwork et al., 2006b). Liu (2016) introduces the generalized Gaussian mechanism (GGM) -pDP that includes the Laplace mechanism as a special case (when and ). Denote the statistics of interest by . The Laplace mechanism and the GGM are based the global sensitivity (GS), which is defined as for all pairs of data sets that are . is the maximum possible difference in defined in terms of the distance between two data sets with . The sensitivity is “global” since it is defined for all possible data sets and all possible ways of these two data sets differing by one record. The larger the GS is for , the larger the disclosure risk is from releasing the original , and the more perturbation is needed for to offset the risk. Specifically, the Laplace mechanism of -DP sanitizes as in , where comprises independent draws from Laplace distribution Lap, where is the -GS of . For integer , the generalized Gaussian mechanism (GGM) of order sanitizes with -pDP by drawing sanitized from the GG distribution where satisfies , , and is the -GS of and is the -GS of . When , the GGM becomes the Gaussian mechanism of -pDP that generates sanitized from N for . When , there exists an analytical lower bound on , that satisfies -pDP; when , numerical approaches can be applied to obtain a lower bound on (Liu, 2016).
The Exponential mechanism is another popular sanitizer of -DP (McSherry and Talwar, 2007). The mechanism is based on a utility function of all possible outputs to a query and the sensitivity of the utility function. Denote by the utility score of output given data . is the set containing all possible outputs , and is the maximum change in score between two data sets and with . The Exponential mechanism of -DP generates from distribution
The Exponential mechanism can sanitize bounded statistics directly by sampling from the distribution in Eq (1) with a predefined bounded domain . The Laplace mechanism and the GGM produce unbound sanitized results from the real line (); as such, some bounding procedures will need to be in place if they are to be applied to sanitize bounded statistics.
3 Bounding of Statistics in Differential Privacy
In this section, we define a bounding procedure in the context of DP, depending on whether it leaks original information or not. We also formalize two commonly used bounding procedures in the context of the Laplace mechanism to set up the framework for the examination of the statistical properties of sanitized outcomes from these procedures in Section 4.
(noninformative bounding) A bounding procedure is noninformative and data invariant if an application of does not reveal more information about the original data in addition to what’s released prior to the the application of . In the context of the differential privacy, suppose , then
If a noninformative bounding procedure is applied to bound sanitized results, then we can spend all pre-specified privacy budget on sanitization, without having to concerning ourselves about the possibility that actual total privacy cost would exceed the budget. On the contrary, if and a bounding procedure leads to , then we will refer to the bounding procedure as informative. A bounding procedure being informative does not mean it cannot be applied to realize DP. Given that the bounding procedure itself costs privacy, a prespecified privacy budget can split between and . However, this would require the quantification of privacy cost on the employment of , which can be difficult. In this discussion, we focus on noninformative bounding (informative bounding will be a topic of future work). We present below two noninformative bounding procedures in the context of the Laplace mechanism. Both are intuitive and effective, and have been employed in practical applications of differentially private sanitizers to release data. The extensions of both procedures to other differentially private mechanisms (such as the GGM) are straightforward.
Denote the bounded statistics by , where are the bounds for element in (), the privacy budget by , the -GS of by , and let .
(truncated Laplace mechanism) The truncated Laplace mechanism of -DP sanitizes by drawing from the truncated Laplace distribution
(boundary-inflated-truncated Laplace mechanism) The boundary-inflated-truncated (BIT) Laplace mechanism of -DP sanitizes by drawing from the BIT Laplace distribution , where is
Both the truncation and BIT procedures can be either informative or non-informative, depending on whether the bounds at which the truncation or BIT occurs are data invariant or not. The truncated Laplace mechanism can also be realized via in a post-hoc manner by throwing away out-of-bounds differentially private sanitized results from the regular Laplace mechanism, which can be computationally expensive compared to direct sampling. Similarly, the BIT bounding procedure can be realized by post-hoc setting out-of-bounds sanitized results from the regular Laplace mechanism at the corresponding boundaries, which is the preferred way to sampling from the BIT Laplace distribution. If the scale parameter in the Laplace distribution as either or , it can be easily proved that in the truncated Laplace mechanism in Eq. (3) converges to an uniform distribution unif, and that in the BIT Laplace distribution in Eq. (4) converges to a Bernoulli distribution with probability mass at and , respectively. In both cases, the sanitized results preserve little original information.
The Laplace and BIT Laplace mechanisms, as in the regular Laplace mechanism, require calculation of the -GS of targeted for sanitization. GS in general needs to be determined analytically though the value might not be tight; numerical computation of GS is not feasible since it is impossible to enumerate all possible data and all possible ways of especially when contains continuous attributes, or when sample size is large. We have obtained the GS of some common statistics, including proportions, means, variances, and covariances (see the online supplementary materials). The GS values were calculated for both definitions of two data sets differing by one record and the results turned out to be the same for both definitions on most of the examined statistics (except for histograms, pooled variances and covariances). In all calculations, we assume the sample size is a known constant and carries no privacy concern, which is often the case in statistical analysis except for, for example, adaptive and group sequential designs, where the final is a function of data. It should be noted that the GS of a function of a statistic is not equal to the function of the GS of in general. For example, of a sample variance is , but of the sample standard deviation (SD) cannot be simply calculated as . In fact, the GS of the SD is more difficult to calculate analytically compared to that of the variance. When the GS of is not easy to calculate, but a data-independent function of , say , is, we can instead sanitize to obtain and then obtain sanitized via the back-transformation .
4 Statistical Properties of Sanitized
In Definition 2, the bounds need to be data invariant and global in order for a truncated or BIT bounding procedure to be noninformative. On the other hand, by ignoring the local properties of data , a noninformative bounding procedure could have an impact on the statistical properties of sanitized results . In this section, we investigate the statistical behaviors of produced by a sanitizer with bounding constraints. We start with defining what statistical properties of sanitized would be desirable (Definitions 3 and 4).
(unbiased of sanitized statistics for original statistics) Sanitized is unbiased for the original if . is asymptotically unbiased for if as , where is the sample size of original data . is consistent for if as .
(asymptotic bias and consistency of sanitized statistics for true parameters) Suppose that , the target statistics for sanitization, are estimators for parameters from a statistical model.
a) If as , and either or , then is asymptotically unbiased for ; that is, .
b) If and as , then then is consistent for ; that is, .
The proof of Defintion 4 is given in Appendix B. Definition 4 implies that a desired statistical property of sanitized results relative to the true parameters can be achieved in two steps. For example, if the desired statistical property is consistency, then the first step is to choose an estimator that is consistent for , which should be relatively easy to complete given that asymptotically unbiased and consistent estimators are well studied in statistics; and the second step to use a sanitizer that generates that is consistent for .
When is boundless and sanitized by the regular Laplace mechanism, then is unbiased for since per the definition of the Laplace distribution. If , where , then is also consistent for . When is bounded and sanitized via the noninformative truncated or BIT Laplace mechanism, we will have biased unless the noninformative bounds are symmetric around the original . Proposition 5 presents the magnitude of the bias of relative to in the noninformative truncated and BIT Laplace mechanisms, respectively, and a sufficient condition for to achieve consistency for . The proofs of Proposition 5 are provided in Appendix A.
(bias of sanitized statistic from truncated Laplace and BIT Laplace mechanism) Let be the global bounds on a singular , be the scale parameter, be the location parameter of the Laplace distribution, be the expected mean of the truncated Laplace distribution , and be the expected mean of the BIT Laplace distribution
a) ( is unbiased for ) if and only if ( and are symmetric around ), where
b) ( and are of the same sign) and ( sanitized via the BIT Laplace sanitizer is no more biased than that via the truncated Laplace sanitizer).
c) sanitized via the truncated Laplace sanitizer or the BIT Laplace sanitizer is asymptotically unbiased and consistent for if the scale parameter approaches 0 asymptotically.
If are global bounds, it is unlikely to have unbiased sanitized results in real-life applications via the truncated or the BIT Laplace mechanism per part a) of Proposition 5 as are fixed while changes from data to data. To achieve unbiasedness for , local bounds that depend on specific data sets can be constructed, but at additional privacy cost. For example, bounds , which are symmetric around , can be used to bound sanitize results in the truncated and BIT Laplace mechanism. However, since the bounds are functions of the original , they will leak information about , which has to be accounted for towards the total privacy cost.
Though sanitized results via sanitizers with noninformative bounding might be biased for the original results, they can still enjoy desirable asymptotic properties such as asymptotic unbiasedness and consistency as under mild regularity conditions per part c) of Proposition 5. In the framework of truncated Laplace and BIT Laplace mechanisms, the scale parameter of the associate Laplace distribution . With pre-specified, the only link between sample size and is . To satisfy the condition , needs to as . Intuitively speaking, as increases, the influence of a single individual on an aggregate measure of a data set is likely to diminish, and the individual is less prone to be identified from releasing the aggregate measure. Translated to the GS of the aggregate measure, it means decreases with . of some commonly used statistics, such as proportions, means, variances and covariances, are (online supplemental materials), by per part c) of Proposition 5, the sanitized copies of these statistics via either the truncated or the BIT Laplace mechanisms are consistent for their original values.
In practice, it is also important to bound the error of a sanitized statistics relative to its original value. Proposition 6 examines the upper bound for the mean squared error (MSE) for a sanitized statistic via the truncated and the BIT Laplace mechanism and examined its rate of converge to 0 as , the scale parameter goes to 0.
(an upper bound on mean squared error and convergence rate) Let be the location parameter, be the scale parameter of the Laplace distribution, and be the sanitized result for via the truncated Laplace or the BIT Laplace mechanisms. is upper bounded by . If , where and is the sample size, then the rate of the MSE converging to 0 is for a given .
The proof is provided in Appendix C. An example of is for sanitizing mean and proportion, where the GS , so is for a fixed . The results in Proposition 6 imply that the the MSE of a sanitized statistic from the truncated and the BIT Laplace mechanism is comparable to that from the regular Laplace mechanism without bounding, the MSE of a sanitized statistic from which is . In other words, the bounding does not affect the MSE of a sanitized statistic despite the loss of unbiasedness in general.
5 Simulation Studies
We conducted two simulation studies to demonstrate the applications of the noninformative truncated and BIT bounding mechanisms and examine the statistical properties of the sanitized results. In the first simulation, we sanitized a variance-covariance matrix, and focused on the rate of the sanitized results approaching the original as the sample size increased and the comparison between the truncated and BIT truncated Laplace mechanisms on their influences on the sanitized results. In the second simulation, we sanitized proportions and focused on the inferential properties of the sanitized proportions by examining the bias, root mean squared errors (RMSE) and coverage probability (CP) for the true proportions based on the sanitized results.
5.1 simulation study 1
In this simulation, we applied the non-informative truncated and BIT Laplace mechanisms to sanitize a variance-covariance matrix in a data set of size . The variance-covariance matrix is an ideal statistic to examine the bounding effects of the two mechanisms given that every element in the matrix has to satisfy some type of bounding constraints; it is also the most common statistic for examining the dependency structure among multiple continuous variables. The constraints in the sanitization of a general covariance matrix of any dimension include that the marginal variances are positive and the correlations are bounded between [-1, 1]. Additionally, the marginal variances are also right-bounded for bounded data from which is calculated. Table 1 summarizes the bounds and the global sensitivity of the components in .
|were the bounds of variables and|
When sanitizing , we first obtain legitimate sanitized and , and then sanitize given and under the constraint that . Though the bounds for depended on and , the latter two are already sanitized; therefore, bounding procedures for using information and does not incur additional privacy cost. It is possible that the sanitized covariance matrix is not positive definite (PD) with the element-wise sanitization approach. If a sanitized covariance matrix that is not PD and has a significant number of (small) negative eigenvalues, then it can be made PD with semidefinite optimization via, e.g., the alternating projections algorithm (Higham, 2002), the Newton methods for nearest correlation matrix (Qi and Sun, 2006; Borsdorf and Higham, 2010), and the spectral projected gradient method to to make the matrix positive definite (Qi and Sun, 2006; Borsdorf et al., 2010) (for example, the R function nearPD() in package Matrix implements the alternating projections algorithm). A possible alternative is to sanitize as a whole instead of element-wise, an interesting and worthwhile topic for future research.
To examine and compare the two noninformative procedures, we examined a variance-covariance matrix with three different specifications of : , and , respectively; and set the global bounds at and at . The 3 correlation settings allow us to examine the bounding effects on the pairwise correlation when there is no correlation, moderate (negative) correlation, and strong (positive) correlation (for centerized and approximately normal variables, the bounds of with a SD of 1 and with a SD of both represents data mass; though this simulation does not require variables to be centerized or normal). The total privacy budget was . Since 3 statistics were sanitized on the same set of data, the sequential composition principle applied (McSherry, 2009). There are many ways to allocate the total budget to the multiple statistics, such as according to the statistical or practical “importance” of the statistics (see Liu (2017) for more discussion), here we divided the total privacy budget equally among the 3 statistics; that is, each sanitization received of the total budget. We also investigated a wide range of sample size from 50 to 800. At each specification of and a given , 500 independent sanitizations were carried out so to examine the distributional properties of the sanitized results.
The results are presented in Figure 1. In each plot, the original results, and the mean, and the 2.5%, 25%, 75% and 97.5% percentiles of the sanitized results are presented. The main findings are summarized as follows. First, when was relatively small, there was noticeable mean deviation of the sanitized results from the original results, except for and when (the boundaries were symmetric about the original results and thus there was no bias per part a of Proposition 5). Second, the sanitized results generated via the truncated Laplace mechanism were more biased than those via the BIT Laplace mechanism, consistent with part b of Proposition 5. Third, as increased, both the deviation (bias) and the dispersion of the sanitized results approached 0, consistent with part c of Proposition 5. Lastly, since the scale parameter of the associated Laplace distribution in both mechanisms was large when was small, more sanitized results were set at the boundary values in the BIT mechanism (especially for the marginal variance and correlation), and the distribution of the sanitized results became flatter in the truncated Laplace mechanism.
5.2 simulation study 2
In this simulation, we aimed to release proportions of the four levels of a categorical variable (). Release of proportions is very common in public data release. In addition, besides the bounding constraints on each proportion element, is also subject to the equality constraint , which has to be retained in the released sanitized results, making it an interesting example to study. Since the cells are disjoint, the addition or removal of a single database element can affect the count in exactly one cell, and the GS () of releasing a vector of disjoint proportions is or depending on which definition of “differing by one record” is used on (see the online supplementary materials). With the Laplace mechanism, sanitization of each proportion in the whole proportion vector is perturbed with a noise term from Lap(). In this simulation, we used , and examined 3 different specifications of ( and 1) and a range of sample size from 50 to 500. The obtained results are also applicable to with doubled . 500 multinomial data sets, each sized at , were simulated from multinomial (These parameter values were chosen because they expand a nice range of proportions – some are closer to the boundaries while others are closer to the center. We expect the inferences on the closer-to-boundaries proportions were more sensitive to the bounding procedures).
The sample proportions were calculated in each simulated data set and were sanitized via the truncated and BIT Laplace mechanisms respectively. We employed 3 procedures to ensure the equality constraint (in addition to the bounding constrained on each proportion. In the first approach (rescaling and normalization), each proportion in was sanitized independently. Since it was very unlikely that the sum of the 4 sanitized proportions, denoted by for , was equal to 1, we normalized as in and released the normalized . The rescaling approach was intuitive but might appear to be ad-hoc. In the second approach (referred to as the all-but-one approach), we sanitized 3 proportions out of 4, and then calculated the 4-th proportion via . The all-but-one approach was also intuitive but less ad-hoc as it obeyed the equality constraint during the sanitization. In the third approach (the hybrid approach), we applied the optimal procedure from Hay et al. (2010) to ensure the equality constrain; the procedure is optimal in that the constrained results are associated the smallest MSE (vs. the sanitized results) among all the linear unbiased estimators that satisfy the equality constraint. The procedure we applied has the following steps, which differ slightly from the original procedure in Hay et al. (2010) due to the the fixed summation of 1. Specifically, we first arranged the 4 proportions in a 3-layer binary tree structure. The root node in the tree always had a value of 1, and its two children and satisfied the constraint (constraint 1). The two children of , and , satisfied (constraint 2); and the two children of , and , satisfied (constraint 3); and thus the four leaf nodes corresponded to the 4 proportions and satisfied . Second, and in layer 2 and , and in layer 3 were sanitized using the the Laplace mechanism with a rate parameter of . The GS doubled in this procedures since there were two sets of proportions released rather than just one set as in the above procedure. Denote the sanitized 6 proportions by . Third, we calculated the inconsistency for each of the nodes in . if was a leaf node in layer 3, and for in layer 2, where were children of . Fourth, we calculated the constrained proportions for the root node , , where was the parent of for the nodes in layers 2 and 3. Finally, the truncated and BIT mechanisms were applied to ensure the elements in were (since calculated this way, though satisfying the 3 equality constraints listed above, could be or ).
In each of the 3 equality constraint approaches, a single set of sanitized proportion was released in each simulation. We calculated the bias and RMSE relative to the true , and the coverage probability (CP) of the 95% confidence interval for the true based on the sanitized . The bias, RMSE, and CP based on the sanitized results were compared to those based on the original . Due to the space limitation, we only presented the results from the rescaling approach (Figure 2), which were the best among all three approaches from a population inferential approach. The results from the and the hybrid approaches are briefly discussed below and can be found in the online supplementary materials.
In the rescaling approach, there was minimal bias in the sanitized when and regardless of and the bounding mechanism (truncated or BIT); and there was some bias at small when , especially for the smallest proportion (positive bias) and the largest proportion (negative bias). Consistent with Proposition 5, the BIT mechanism yielded less biased sanitized results than the truncated mechanism. In addition, the RMSE was inflated in the sanitized results compared to the original RMSE, which was expected considering the noise introduced during the sanitization step. The larger the privacy budget or the larger was, the smaller the inflation was. Though the BIT mechanism led to smaller bias compared to the truncated mechanism for small when , the RMSE values were larger in the former than the latter in this simulation. Finally, there was undercoverage for and , which spanned a wider range of and got more severe as decreased (the empirical coverage was around 95% at all when , worsened to for a wider range of when , and further deteriorated to across the whole range of when ). The BIT Laplace mechanism also had worse undercoverage than the truncated mechanism at small for . Compared to the rescaling approach, the all-but-one approach had similar performance in bias and RMSE for and 1, similar CP performance in all proportions but , which was the proportion being calculated from the other 3. When , the the BIT mechanism offered smaller bias when coupled with the all-but-one approach compared to the truncation approach; both bounding mechanisms had similar RMSE and CP in the all-but-one approach compared to the re-scaling approach in all proportions but . Compared to the rescaling approach, the hybrid approach was worse in bias, RMSE and CP for all proportions at all values when was relatively small; the under-performance was the most obvious when and was less evident when and .
The undercoverage can be resolved to some degree by using the multiple synthesis (MS) technique in DP (Liu, 2017; Bowen and Liu, 2016). The MS takes into the variability introduced by the sanitization process by releasing multiple synthetic sets. 5 sets were independently sanitized and released per original result in each simulation scenario; and the inferences were combined over the 5 synthesized sets using the rule given in Liu (2017). Iin order to maintain the overall -DP, each set was sanitized using of the total budget per the sequential composition theorem. The results are given in Figure 3. The CP improved significantly from releasing multiple synthetic data sets, especially in the case of the BIT mechanism. However, due to the decreased privacy budget per synthetic set and the bounding, the sanitized results were much noisier and the biases were noticeably much larger, even after averaging the sanitized results. For example, in the case of the truncated mechanism when total , the bias never diminished to 0 and the RMSE never reached the original RMSE levels within the examined range of sample size . The comparisons between the all-but-one and the hybrid approaches vs the rescaling approach were similar as in the single synthesis approaches.
We have introduced the concept of noninformative bounding in the sanitization of statistics with finite bounds and investigated its impacts on the statistical properties of sanitized results in the context two modified Laplace mechanisms for bounded statistics — truncated and BIT, both theoretically and empirically via simulation studies. Both the noninformative truncated and BIT Laplace mechanisms produce biased sanitized results for their original observed values unless the noninformative global bounds are symmetric around the original results, which is a hard-to-satisfy condition in real life given than the original statistics change by data while the global bounds are fixed. However, sanitized results can be consistent for model parameters if the scale parameter of the Laplace distribution with the truncated and BIT Laplace sanitizers approaches 0 as data sample size increases, and if the original statistics are consistent estimators for the parameters. We also provided an upper bound for the MSE between the sanitized and the original results for a definite in the truncated Laplace and the BIT Laplace mechanism; the bound goes to 0 if the scale parameter goes to 0 as approaches .
Though the BIT Laplace mechanism in theory delivers less biased sanitized statistics than the truncated Laplace mechanism, the former does not seem to be more advantageous over the latter in practical applications, factoring in the following considerations. First, asymptotic unbiasedness and consistency hold under the same regularity conditions in both mechanisms and there is little difference between the two when is large. Second, the truncated Laplace distribution is a smooth distributional while the BIT Laplace distribution is discrete and comprises of 3 pieces. Though the distributional shape might be irrelevant in the release of a single sanitized statistic, it will matter in some differentially private data release mechanisms such as the model-based differentially private synthesis (modips) approach (Liu, 2017). Last, the discrete 3-piece distributional shape of the BIT Laplace distribution requires the intervals of the outcomes to be closed on both ends so that the boundary values are exclusively defined. This is not necessary for the truncated Laplace distribution where the density function is continuous and smooth. This last point seems trivial but can be annoying in practical applications. For example, in the first simulation, closed-intervals and were applied to variance and correlation, respectively. Some sanitized outputs were exactly 0 for variance, and exactly -1 or 1 for correlation from the BIT Laplace mechanism. In practice, these values are rare occurrences due to measurement errors and noises, and users may choose to reject the sanitized results exactly at the boundary values. If the users demand more plausible results that agree with real-life situations, the decision of using what values to replace the implausible boundary values becomes arbitrary and could also potentially affect the statistical properties of the sanitize results. Those concerns do not exist in the truncated Laplace mechanism.
This paper has focused on the applications of the truncated and BIT bounding procedures in the framework of the Laplace mechanism of -DP. The bounding procedures are general enough to be extended to other differentially private sanitizers with unbounded numerical supports, and to the soft versions of DP, when sanitizing bounded statistics; and the statistical properties of sanitized results from these extended applications will have to be examined case of by case.
As briefly mentioned in the Introduction and examined in the second simulation study, satisfying both equality and inequality constraints is important in releasing differentially private results. The simulation study examined 3 different approaches to satisfying the equality constraints among a vector of proportions and turned out the seemingly most ad-hoc procedure performed the best from a population inferential perspective. This is only a small empirical attempt in combining the equality and inequality constraints but it raises interesting questions; future work is definitely warranted to develop general innovative and efficient mechanisms that deliver optimal constrained results under both inequality and equality constraints.
The online supplementary materials contain the calculations of the GS of some common statistics, including proportion, mean, variance, and covariance; as well as additional results from Simulation Study 2. The materials are available at https://www3.nd.edu/~fliu2/bounding-suppl.pdf.
Appendix A Proof of Proposition 5
The mean of a truncated Laplace distribution Lap is
The mean of the BIT Laplace distribution is . Since and , and given the result from Part a), then is
Part a): In the case of , is unbiased for if . Let , where is a real number. is symmetric about . ; therefore, is a monotonic increasing function when and a monotonic decreasing function when . Taken together, and is unbiased for iff and are symmetric about . In the case of , is unbiased for if . is symmetric about Let , where is a real number. ; therefore, is a monotonic increasing function when and a monotonic decreasing function when . Taken together, and is unbiased for iff and are symmetric about .
Part b): When (both ). Since , then . In the case of , we have shown in Part c) that is symmetric and monotonically decreasing with ; therefore, and the numerator in Eq. (5) is . Since and , the denominator in Eq. (5) . Taken together, . When , we can prove and in a similar manner as when . To compare the magnitude of the vs , we compare the magnitude of bias:
Let and , then the last equation above is to compare v.s. 0. The first derivative