1 Introduction


1 Introduction

In today’s data-driven world, protecting the privacy of individuals’ information is of the utmost importance to data curators, both as an ethical consideration and as a legal requirement, e.g. Article 29 of the European Union’s General Data Protection Regulation describes privacy risks as singling out, linkability and inference.

Sequential data, such as DNA sequences, textual data and mobility traces, is being increasingly used in a variety of real-life applications, spanning from genome and language modeling to location-based recommendation systems. However, using such data poses considerable threats to individual privacy. It might be used by a malicious adversary to discover potential sensitive information about a data owner such as their habits, religion or relationships.

Data anonymisation is a popular means of privacy preservation in datasets. One such example is the K-anonymity framework [15], [7], which anonymises data by generalising quasi identifiers, ensuring that an individual’s data is indistinguishable from at least () others’. However, even the K-anonymity approach still poses privacy concerns, since it is deterministic and susceptible to privacy attacks, such as linkage attacks. It is therefore urgent to respond to the failure of existing anonymisation techniques by developing new schemes with provable privacy guarantees.

Differential privacy is one of the only schemes that can be used to provide such guarantees. The main idea of differential privacy is to add noise to operations performed on a dataset so that an adversary cannot decide whether a particular user is included in the dataset or not. Due to the inherent sequentiality and high-dimensionality of sequential data, it is challenging to apply differential privacy. In particular, naively adding noise to the occurrence counts of each distinct sequence or n-gram in the dataset negatively impacts the utility [2]. Other approaches rely on applying differential privacy to the operation of set union [6] by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts rise above a certain threshold. However, these mechanisms do not learn the counts or the probability distribution of these items, which requires adding significantly large noise. One possible direction for mitigating the curse of dimensionality is to leverage Bayesian learning, utilising public data to shape our prior of the private data [3]. Learning a Bayesian model typically involves sampling from a posterior distribution, therefore the learning process is inherently randomized.

In this paper, we model applying differential privacy as taking a sample from the posterior distribution in a Bayesian setup. We propose a practical mechanism that efficiently uses public data to improve the utility and reduce the privacy loss. The paper is organised as follows. Section 2 gives a gentle introduction to differential privacy, defining core concepts and mechanisms that achieve differential privacy. Section 3 provides a walk-through our proposed mechanism, and Section 4 discusses the experimental results of our mechanism in comparison with other mechanisms. Finally, Section 5 tries to empirically evaluate the privacy loss of our mechanism by performing membership inference attacks. Proofs of the theorems, claims and corollaries are included in the Appendices.

2 Preliminaries

In this section, we review the definition of -differential privacy, and the exponential mechanisms to achieve it.

2.1 Differential Privacy

Differential privacy is a statistical guarantee of privacy protection. It renders individuals’ information indistinguishable by adding noise [4], [5]. Let be the set of the attributes and the set of users their information we want to protect, then is a database of counts where each row corresponds to a user and is user count of attribute , i.e. . We refer to the attributes set , which is independent of the users, as the vocabulary and the attributes as words, but they can equally refer to n-grams. The user adjacency is defined such that and are two adjacent datasets, in which does not include user .

Definition 1.

(Differential Privacy) Let and be given privacy parameters. We say a randomized mechanism satisfies -differential privacy, if for any adjacent datasets and and all measurable sets , we have:

To apply differential privacy to a function we need to know its sensitivity:

Definition 2.

(-sensitivity) The -sensitivity of a function is:

2.2 The Exponential Mechanism

We define two standard exponential mechanisms that achieve differential privacy, specifically the Laplace and Gaussian mechanisms [4], [5].

Definition 3.

(Laplace Mechanism) For any function , the mechanism:

gives -differential privacy, where is a b-scale Laplace distribution and is the -sensitivity of .

Definition 4.

(Gaussian Mechanism) For any function , the mechanism:

gives -differential privacy, for and .

3 Bayesian Approach to Utilise Public Data

Let be the public counts, defined as a one row database, and the private database counts are defined over the public database vocabulary . To combine public and private counts in Bayesian differential privacy settings, we model the public data as the prior and the private data as the likelihood. Then, we compute the posterior and take a sample from it as the output [17]. In the simple case, the public counts can be modelled as a Dirichlet distribution and the private counts as a Multinomial distribution [9], [18]. However, to form a differential private mechanism we must truncate the Dirichlet distribution at the tails to prevent the sensitivity of the private counts from exploding [3], [18]. Instead, we can approximate the Dirichlet distribution of the public counts in the softmax space as a Gaussian distribution [12], [16] where is the standard deviation and:


which represents the prior distribution and the addition of 1 is important to avoid running into infinity. Using the private counts as a likelihood distribution in the Bayesian settings requires applying the softmax, i.e. the posterior distribution is . However, this is not a closed form posterior and so a differentially-private iterative sampling algorithm is required [13]. To mitigate this problem, we use a trick by modelling the overall private counts as one sample in the softmax/log space as follows, where is the total counts of word :


The adjacent database with user removed is defined as:


Consequently, the likelihood can be represented by a Gaussian distribution: , where does not depend on the private database. Since a Gaussian is a conjugate prior to itself, the posterior is also a Gaussian distribution: , where , and is a hyperparameter to be chosen. We then apply the softmax to get the normalised probability distribution:

Theorem 1.

Given the Gaussian mechanism := and its -sensitivity:
, then mechanism is ()-differentially private, where , and are defined by equations 1, 2 and 3 respectively.

3.1 Word-wise Adaptive Clipping Strategy [14]

We examine the -sensitivity in Theorem 1:

We observe from the equation above that the sensitivity depends mostly on words with large user percentage contribution (). That means frequent words have little effect on the sensitivity as opposed to rare words, even though they contribute the most to the utility, e.g. KL divergence. To mitigate the effect of rare words, we define a decay function where is the number of users that have the word in their counts. A good decay function will penalise rare words more as fewer users contributed these words and so their sensitivity effect is large. The probability mass of these rare words can be taken from the public distribution. We experimented with clamped logarithmic, exponential and linear decay functions, and they all behaved similarly, therefore we decided to pick the linear function as it is the simplest and has a constant gradient:


Where is the maximum per-user word count and is a weighting hyperparameter.

Corollary 1.

Let then the
modified Gaussian mechanism := is ()-differentially private if where is the indicator function, is element-wise multiplication and and are defined in equation 5.

The -sensitivity can be computed by a brute-force method that is , where is the maximum number of words each user can contribute and is the number of users in the private database. The -sensitivity can also be bounded provided , and the private counts .

Claim 1.

Given , , and the number of users for each word , then the sensitivity defined in corollary 1 is bounded as follows:

Input :

, , , , , , ,

Result: normalised counts

for  to  do

        ; // clamp counts to C
end for
; // overall word counts no_users_per_word; ; // equation 2 ; // equation 5 if Using the brute-force method to estimate sensitivity then
        ; // initial sensitivity for  to  do
        end for
        Compute worst-case estimate of the sensitivity (); // claim 1
end if
; // corollary 1 ; // equation 1   Sample
Algorithm 1 BayesianDP

3.2 Differentially Private Hyperparameter Tuning

, and are hyperparameters and their values directly affect the ratio between the public and private counts and the variance of the posterior distribution for a given privacy parameters (). If we ignore the noise we can compare the mean output of after applying the softmax, i.e , to the normalised private counts distribution using KL-divergence. In other words, KL-divergence, for given private and public counts, is a function of , and . Consequently, there are optimal values at which the KL-divergence is minimised for any given public and private databases. is the weighting between private and public counts and and affect the amount of probability mass redistributed from frequent to rare words. A very small or very large transform the private counts to a uniform distribution. However, early experiments showed that has little effect on the KL-divergence unless the number of users is very large and/or there are rare words with high per-user counts, the latter is a clear privacy violation, so we do not vary and fix it to either 1 or 10 in all our experiments. Consequently, we only have two hyperparameters to vary and a grid-search hyperparameter tuning approach can be used at an extra cost of a small privacy budget (since we are only adding noise to scalar value) using a variant of the noisy max algorithm [11], [10].

The idea of private hyperparameter tuning is to split the private counts into two non-overlapping sets, evaluate the KL-divergence between the normalised counts () of the second set and the mean () of the Gaussian mechanism applied to the first set for different hyperparameters values (). Then, add Laplace noise with privacy loss to each KL-divergence score and find the hyperparameter values () that give the minimum noisy score. Finally, using the mean () at these values, we sample from the ()-DP Gaussian mechanism and report the final privacy parameters () using the strong composition theorem [4]. However, estimating the sensitivity of KL-divergence can be very complex and it is easier to use the un-normalised cross-entropy since they both exhibit their global minimum at the same hyperparameter values:

Theorem 2.

Given two non-overlapping sets of private counts, let be the counts of the first set and be evaluations of the mean of the Gaussian mechanism in Corollary 1, applied to the second set, for different hyperparameter values . Then for any -Lipschitz continuous scoring function: , the mechanism: , is -differential private, where is a -scale Laplace distribution and is the maximum of the scoring function sensitivities with respect to each set separately, i.e

As mentioned before, using un-normalised cross-entropy as a scoring function instead of KL-divergence simplifies the sensitivity analysis while achieving the same result.

Claim 2.

If the scoring function in Theorem 2 is defined by Equation 6, then sensitivity is bounded as follows:

where is the first private set and is their maximum word count.

The bounds in Claim 2 allow us to reuse the estimated sensitivity of the second private set when sampling from ()-differentially private Gaussian mechanism.

Input :

, , , , , , , ,

Result: normalised counts

Clamp the counts and compute // Extracted from algorithm 1

for  to  do

        BayesianDP(); // from equation 6 and from claim 2
end for
Algorithm 2 EndToEndDP

4 Experiments and Results

4.1 Dataset

For all experiments we used the Reddit data, which consists of 4.4M users, as the private dataset and the Google Billion Word [1] as the public dataset. We extracted trigrams and their counts from these datasets to form our database. Accordingly, the vocabulary is the total number of distinct trigrams, which is determined only by the public dataset, and the output distribution of our mechanism is the joint trigram distribution or .

4.2 Baselines

We used two simple baselines in our experiments:

  1. The standard Laplace mechanism [4], [5], where we allow each user to contribute up to counts overall, and add Laplace noise with scale , then threshold at 0 and normalise to get a valid probability distribution.

  2. The other baseline is a modified Laplace mechanism that utilises public counts. We first normalise the public count, then for each user independently, we normalise the user counts and subtract the normalised public counts from it. After that, we get the weighted average of the subtracted output over all the users and add Laplace noise to it. Finally we add the normalised public counts, threshold at 0 and re-normalise. In other words, the un-normalised output of the modified Laplace mechanism:

    Where is the normalised public counts, and are the normalised counts and weighting of user and is the Laplace noise and is the -sensitivity of the modified mechanism.

4.3 Experimentation setup

For all experiments, the allowed word counts per user is set to 1 if the number of users is less than 1M, otherwise . We also limit the total number of counts for each user to . These values were chosen empirically from early experiments to be good utility-privacy trade-offs. For our Bayesian mechanism, we used the brute-force method to compute the sensitivity as it gives tighter bounds. In hyperparameter tuning, we split the private data into train set and validation set, where we spend on training and on private hyperparameter tuning, and .

4.4 Performance under different settings

We evaluate the utility of our mechanism, measured as the KL-divergence from the private data distribution, under different settings such as the number of users in private database, the size of vocabulary set and the quality of the public data. We also compare its utility with other baseline mechanisms, as shown in Figures 1 to 4.

Figure 1: KL-divergence from the private distribution vs for different mechanisms. The number of users is 4M and vocabulary size is 50K.

Figure 2: Left: KL-divergence from the private distribution vs the number of users for Bayesian differential privacy compared to a public baseline. Right: Laplace differential privacy with and without public data. and vocabulary size is 50K.
Figure 3: Left: KL-divergence from the private distribution vs the number of distinct trigrams (vocabulary size) for Bayesian differential privacy compared to a public baseline. Right: Laplace differential privacy with and without public data. The number of users is 4M and .
Figure 4: KL-divergence of Bayesian differential privacy vs the scale of the Laplace noise added to the public data to deteriorate its quality. The number of users is 4M, and .

4.5 Comparison with K-Anonymity

We compare our mechanism to K-anonymity, by evaluating the KL-divergence of the method at different values of K as shown in Figure 5. For , we achieve similar utility to K-anonymity with . However, differential privacy offers superior privacy guarantees.

Figure 5: Left: KL-divergence from the private distribution of K-anonymity vs K. Right: KL-divergence of Bayesian DP vs . Both plots have the same scale on the y-axis (KL-divergence). The number of users is 100K and vocabulary size is 50K.

4.6 Performance in Language Modeling

We compare the perplexity of our mechanism applied to n-grams in language modeling to the other baselines and K-anonymity. Since the performance of n-gram language models depends on the back-off strategies and how out-of-vocabulary words are handled, we constrain the data to only in-vocabulary words and sentences covered by trigrams so no back-off strategy is required, leading to a fair comparison in as shown in Table 1.

Mechanism Perplexity
Private baseline 29.48
K-anonymity with K = 50 35.51
Bayesian DP at 35.55
Laplace DP + public data at 580.46
Laplace DP at 894.32
Public baseline 43.00
Table 1: Average word perplexity for different mechanisms using trigram language models. The number of distinct n-grams is 60K

5 Empirical Privacy Bounds By Membership Inference Attack

The reason why our Bayesian mechanism is outperforming the standard Laplace mechanism for the same is because our mechanism has tighter bounds on the privacy loss. This is due to the word-based clipping strategy that minimizes the sensitivity effect of rare words. Moreover, the brute-force method of estimating the sensitivity tied with the hyperparameter tuning allows us to find the optimal privacy vs utility trade-off. To confirm this, we apply a membership inference attack [8] to the output of our Bayesian mechanism and compare against the Laplace mechanism. To perform the attack, we apply the mechanism to two adjacent datasets, one of which does not have the most contributing, in terms of user counts. We then sample two probability distributions from each dataset and repeat the same process times. The probability of membership inference is then the number of times the KL-divergence, relative to the removed user distribution, of is less than that of divided by . Plots in Figure 6 show that the probability of membership inference of the Laplace mechanism grows slowly with , even at the probability is still less than , whereas the inference probability of the Bayesian mechanism approaches as soon as . Therefore, the Bayesian mechanism gives a more accurate estimate of the true privacy loss , since from the definition of differential privacy, a indicates that the ratio between the output probability of the adjacent datasets is , which is a clear privacy violation. In other words, the Laplace mechanism is excessively pessimistic about and it is possible to achieve much higher utility at a lower using the Bayesian mechanism.

Figure 6: Membership inference probability vs for Laplace DP and Bayesian DP. The number of users is 1.2M and the vocabulary size is 50K.

6 Conclusion

In this paper, we proposed a novel Bayesian differential privacy approach, applied to n-grams, which utilises public data to provide a significantly better utility vs privacy trade-off compared to well-known privacy mechanisms, such as the Laplace mechanism. In our approach, we transform the counts to log space, approximating the distribution of the public and private data as Gaussian. The posterior distribution is then evaluated and softmax is applied to produce a probability distribution. We performed several experiments on n-grams from the Reddit dataset to demonstrate the superior performance of our mechanism, achieving up to 85% reduction in KL-divergence compared to the Laplace mechanism for privacy loss , and similar KL-divergence performance to K-anonymity with . Finally, we applied a membership inference attack to explain the improvement of our mechanism over the Laplace mechanism. The attack showed that our mechanism provides tighter bounds on the privacy loss and thus gives better estimate of . Future work will investigate using a similar Bayesian approach to apply differential privacy to deep neural networks during Stochastic Gradient Descent (SGD) training.


The authors of this paper would like to acknowledge the SwiftKey Task and Intelligence Research team, in particular Joe Osborne and Dmitry Stratiychuk, for their insights and suggestions that guided the research. We would also like to acknowledge everyone that reviewed this paper and provided invaluable guidance and feedback.

Appendix A Proof of Theorem 1 and Corollary 1

We start by proving Theorem 1 where the Gaussian mechanism proposed can be composed into a deterministic function and an additive Gaussian noise .
Define , where is the set of all private users of a database and is the vocabulary size. Adopting the same notation used in theorem 1, let be an adjacent database to with user removed and let be the public and user counts respectively. The adjacency, , is expressed formally as and . Consequently, , where:

satisfies a ()-DP Gaussian mechanism if the standard deviation , where

And rearranging this will give:

Corollary 1 follows from Theorem 1 if we replace with , where as defined by equation 5 and , since each user can only affect the number of users for any word by 1. As a result:

The identity function in the last equality stems from clamping the decay function at , i.e the gradient is zero for or . We note also that the first equality can be rearranged differently, which will come handy in the proving claim 1, as follows:


Appendix B Proof of Claim 1

Claim 1 provides a worst-case estimate of the sensitivity , that does not require recomputing sensitivity of for each user . Firstly, we state the following lemma:

Lemma 1.

The sensitivity of , which is defined in equation 2, is bounded from above as follows:

Then, using this lemma, we start the proof from Equation 7:

The last two inequalities follow from the monotonicity of and Lemma 1 respectively.
Now we prove Lemma 1, by defining a strictly increasing function similar to :

The function has two important properties:

  • It is a concave function, since it’s a strictly increasing linear combination of

From the first property we can conclude that is -Lipschitz over with respect to the if , since and are dual norms. Moreover:

The inequality follows because the smallest word count is equal to the number of users that have the word. Consequently, the maximum norm of the gradient: . From the Lipschitz property:

Appendix C Proof of Theorem 2

The first part of the theorem is a variant of the noisy max algorithm, with the relaxation that the score function does not have to be monotonic. To follow the same steps as the noisy max proof, we will prove the theorem for the maximum of the negative score function . This is allowed because of the symmetry of the Laplace noise.

Let and be the scores for the adjacent databases for and . For any , fix , a draw from used for all noisy scores except the kth score. We will argue for each independently.
Define the notation to mean the probability that the output of the mechanism (index of the maximum noisy score) is , conditioned on .
First, we show that . Define

Since we want the maximum value, will be the output when the database is if and only if . Let , we have for all :

Thus, if , then the kth score will be the maximum when the database is and the noise vector is . The probabilities below are over the choice of :

Rearranging and multiplying both sides by :

Now we show , following similar steps as above. Define

will be the output(argmax noisy score) when the database is if and only if we have for all :

Thus, if , then the kth score will be the maximum when the database is and the noise vector is . The probabilities below are over the choice of :

And so:

Finally, since:

which is -differentially private, it is trivial to show that is also -differentially private, because the Laplace distribution is symmetric around and the above proof only requires the sensitivity of and no other assumption is made on the properties of .

The second part of the theorem is a direct consequence of having non-overlapping private sets, i.e any user removed in can only be from one of the private sets and not the other, thus we can treat each set separately and take the maximum sensitivity of both.

Appendix D Proof of Claim 2

The sensitivity with respect to the first private set, , is a brute-force evaluation of adding and removing each user in the set. This is bounded for all users if you take the maximum word count in the set as shown in the claim. Therefore, here we prove the upper bound of the sensitivity with respect to the second private set .
Let so . We first note that score function defined in equation 6 is a convex function in since it is a scaled version of KL-divergence. Consequently, is -Lipschitz over with respect to the norm if , since is self-dual. Moreover, for all :

Thus is -Lipschitz with . The Lipschitz property implies that:


  1. footnotetext: During his time at Microsoft


  1. C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn and T. Robinson (2013) One billion word benchmark for measuring progress in statistical language modeling. Technical report Google. External Links: Link Cited by: §4.1.
  2. R. Chen, G. Acs and C. Castelluccia (2012) Differentially private sequential data publication via variable-length n-grams. In Proceedings of the 2012 ACM conference on Computer and communications security, pp. 638–649. Cited by: §1.
  3. C. Dimitrakakis, B. Nelson, Z. Zhang, A. Mitrokotsa and B. I. Rubinstein (2017) Differential privacy for bayesian inference through posterior sampling. The Journal of Machine Learning Research 18 (1), pp. 343–381. Cited by: §1, §3.
  4. C. Dwork and A. Roth (2014-08) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4), pp. 211–407. External Links: ISSN 1551-305X, Link, Document Cited by: §2.1, §2.2, §3.2, item 1.
  5. C. Dwork (2006) Differential privacy, in automata, languages and programming. ser. Lecture Notes in Computer Scienc 4052, pp. 112. Cited by: §2.1, §2.2, item 1.
  6. S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi and S. Yekhanin (2020) Differentially private set union. arXiv preprint arXiv:2002.09745. Cited by: §1.
  7. N. Holohan, S. Antonatos, S. Braghin and P. Mac Aonghusa (2017) -anonymity: k-anonymity with -differential privacy. arXiv preprint arXiv:1710.01615. Cited by: §1.
  8. B. Jayaraman and D. Evans (2019) Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1895–1912. Cited by: §5.
  9. D. Klein (2008) Lecture notes in statistical nlp. Department of Electrical Engineering and Computer Sciences at UC Berkeley. Cited by: §3.
  10. K. Ligett, S. Neel, A. Roth, B. Waggoner and S. Z. Wu (2017) Accuracy first: selecting a differential privacy level for accuracy constrained erm. In Advances in Neural Information Processing Systems, pp. 2566–2576. Cited by: §3.2.
  11. J. Liu and K. Talwar (2019) Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298–309. Cited by: §3.2.
  12. D. J. MacKay (1998) Choice of basis for laplace approximation. Machine learning 33 (1), pp. 77–86. Cited by: §3.
  13. K. Minami, H. Arai, I. Sato and H. Nakagawa (2016) Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pp. 956–964. Cited by: §3.
  14. V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi and S. Kumar (2019) AdaCliP: adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643. Cited by: §3.1.
  15. P. Samarati and L. Sweeney (1998) Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Cited by: §1.
  16. A. Srivastava and C. Sutton (2017) Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §3.
  17. Y. Wang, S. Fienberg and A. Smola (2015) Privacy for free: posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pp. 2493–2502. Cited by: §3.
  18. S. Zheng (2015) The differential privacy of bayesian inference. Bachelor’s thesis, Harvard College. Cited by: §3.