1 Introduction
In today’s datadriven world, protecting the privacy of individuals’ information is of the utmost importance to data curators, both as an ethical consideration and as a legal requirement, e.g. Article 29 of the European Union’s General Data Protection Regulation describes privacy risks as singling out, linkability and inference.
Sequential data, such as DNA sequences, textual data and mobility traces, is being increasingly used in a variety of reallife applications, spanning from genome and language modeling to locationbased recommendation systems. However, using such data poses considerable threats to individual privacy. It might be used by a malicious adversary to discover potential sensitive information about a data owner such as their habits, religion or relationships.
Data anonymisation is a popular means of privacy preservation in datasets. One such example is the Kanonymity framework [15], [7], which anonymises data by generalising quasi identifiers, ensuring that an individual’s data is indistinguishable from at least () others’. However, even the Kanonymity approach still poses privacy concerns, since it is deterministic and susceptible to privacy attacks, such as linkage attacks. It is therefore urgent to respond to the failure of existing anonymisation techniques by developing new schemes with provable privacy guarantees.
Differential privacy is one of the only schemes that can be used to provide such guarantees. The main idea of differential privacy is to add noise to operations performed on a dataset so that an adversary cannot decide whether a particular user is included in the dataset or not. Due to the inherent sequentiality and highdimensionality of sequential data, it is challenging to apply differential privacy. In particular, naively adding noise to the occurrence counts of each distinct sequence or ngram in the dataset negatively impacts the utility [2]. Other approaches rely on applying differential privacy to the operation of set union [6] by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts rise above a certain threshold. However, these mechanisms do not learn the counts or the probability distribution of these items, which requires adding significantly large noise. One possible direction for mitigating the curse of dimensionality is to leverage Bayesian learning, utilising public data to shape our prior of the private data [3]. Learning a Bayesian model typically involves sampling from a posterior distribution, therefore the learning process is inherently randomized.
In this paper, we model applying differential privacy as taking a sample from the posterior distribution in a Bayesian setup. We propose a practical mechanism that efficiently uses public data to improve the utility and reduce the privacy loss. The paper is organised as follows. Section 2 gives a gentle introduction to differential privacy, defining core concepts and mechanisms that achieve differential privacy. Section 3 provides a walkthrough our proposed mechanism, and Section 4 discusses the experimental results of our mechanism in comparison with other mechanisms. Finally, Section 5 tries to empirically evaluate the privacy loss of our mechanism by performing membership inference attacks. Proofs of the theorems, claims and corollaries are included in the Appendices.
2 Preliminaries
In this section, we review the definition of differential privacy, and the exponential mechanisms to achieve it.
2.1 Differential Privacy
Differential privacy is a statistical guarantee of privacy protection. It renders individuals’ information indistinguishable by adding noise [4], [5]. Let be the set of the attributes and the set of users their information we want to protect, then is a database of counts where each row corresponds to a user and is user count of attribute , i.e. . We refer to the attributes set , which is independent of the users, as the vocabulary and the attributes as words, but they can equally refer to ngrams. The user adjacency is defined such that and are two adjacent datasets, in which does not include user .
Definition 1.
(Differential Privacy) Let and be given privacy parameters. We say a randomized mechanism satisfies differential privacy, if for any adjacent datasets and and all measurable sets , we have:
To apply differential privacy to a function we need to know its sensitivity:
Definition 2.
(sensitivity) The sensitivity of a function is:
2.2 The Exponential Mechanism
We define two standard exponential mechanisms that achieve differential privacy, specifically the Laplace and Gaussian mechanisms [4], [5].
Definition 3.
(Laplace Mechanism) For any function , the mechanism:
gives differential privacy, where is a bscale Laplace distribution and is the sensitivity of .
Definition 4.
(Gaussian Mechanism) For any function , the mechanism:
gives differential privacy, for and .
3 Bayesian Approach to Utilise Public Data
Let be the public counts, defined as a one row database, and the private database counts are defined over the public database vocabulary . To combine public and private counts in Bayesian differential privacy settings, we model the public data as the prior and the private data as the likelihood. Then, we compute the posterior and take a sample from it as the output [17]. In the simple case, the public counts can be modelled as a Dirichlet distribution and the private counts as a Multinomial distribution [9], [18]. However, to form a differential private mechanism we must truncate the Dirichlet distribution at the tails to prevent the sensitivity of the private counts from exploding [3], [18]. Instead, we can approximate the Dirichlet distribution of the public counts in the softmax space as a Gaussian distribution [12], [16] where is the standard deviation and:
(1) 
which represents the prior distribution and the addition of 1 is important to avoid running into infinity. Using the private counts as a likelihood distribution in the Bayesian settings requires applying the softmax, i.e. the posterior distribution is . However, this is not a closed form posterior and so a differentiallyprivate iterative sampling algorithm is required [13]. To mitigate this problem, we use a trick by modelling the overall private counts as one sample in the softmax/log space as follows, where is the total counts of word :
(2) 
The adjacent database with user removed is defined as:
(3) 
Consequently, the likelihood can be represented by a Gaussian distribution: , where does not depend on the private database. Since a Gaussian is a conjugate prior to itself, the posterior is also a Gaussian distribution: , where , and is a hyperparameter to be chosen. We then apply the softmax to get the normalised probability distribution:
(4) 
Theorem 1.
3.1 Wordwise Adaptive Clipping Strategy [14]
We examine the sensitivity in Theorem 1:
We observe from the equation above that the sensitivity depends mostly on words with large user percentage contribution (). That means frequent words have little effect on the sensitivity as opposed to rare words, even though they contribute the most to the utility, e.g. KL divergence. To mitigate the effect of rare words, we define a decay function where is the number of users that have the word in their counts. A good decay function will penalise rare words more as fewer users contributed these words and so their sensitivity effect is large. The probability mass of these rare words can be taken from the public distribution. We experimented with clamped logarithmic, exponential and linear decay functions, and they all behaved similarly, therefore we decided to pick the linear function as it is the simplest and has a constant gradient:
(5) 
Where is the maximum peruser word count and is a weighting hyperparameter.
Corollary 1.
Let then the
modified Gaussian mechanism := is ()differentially private if where is the indicator function, is elementwise multiplication and and are defined in equation 5.
The sensitivity can be computed by a bruteforce method that is , where is the maximum number of words each user can contribute and is the number of users in the private database. The sensitivity can also be bounded provided , and the private counts .
Claim 1.
Given , , and the number of users for each word , then the sensitivity defined in corollary 1 is bounded as follows:
3.2 Differentially Private Hyperparameter Tuning
, and are hyperparameters and their values directly affect the ratio between the public and private counts and the variance of the posterior distribution for a given privacy parameters (). If we ignore the noise we can compare the mean output of after applying the softmax, i.e , to the normalised private counts distribution using KLdivergence. In other words, KLdivergence, for given private and public counts, is a function of , and . Consequently, there are optimal values at which the KLdivergence is minimised for any given public and private databases. is the weighting between private and public counts and and affect the amount of probability mass redistributed from frequent to rare words. A very small or very large transform the private counts to a uniform distribution. However, early experiments showed that has little effect on the KLdivergence unless the number of users is very large and/or there are rare words with high peruser counts, the latter is a clear privacy violation, so we do not vary and fix it to either 1 or 10 in all our experiments. Consequently, we only have two hyperparameters to vary and a gridsearch hyperparameter tuning approach can be used at an extra cost of a small privacy budget (since we are only adding noise to scalar value) using a variant of the noisy max algorithm [11], [10].
The idea of private hyperparameter tuning is to split the private counts into two nonoverlapping sets, evaluate the KLdivergence between the normalised counts () of the second set and the mean () of the Gaussian mechanism applied to the first set for different hyperparameters values (). Then, add Laplace noise with privacy loss to each KLdivergence score and find the hyperparameter values () that give the minimum noisy score. Finally, using the mean () at these values, we sample from the ()DP Gaussian mechanism and report the final privacy parameters () using the strong composition theorem [4]. However, estimating the sensitivity of KLdivergence can be very complex and it is easier to use the unnormalised crossentropy since they both exhibit their global minimum at the same hyperparameter values:
(6) 
Theorem 2.
Given two nonoverlapping sets of private counts, let be the counts of the first set and be evaluations of the mean of the Gaussian mechanism in Corollary 1, applied to the second set, for different hyperparameter values . Then for any Lipschitz continuous scoring function: , the mechanism: , is differential private, where is a scale Laplace distribution and is the maximum of the scoring function sensitivities with respect to each set separately, i.e
As mentioned before, using unnormalised crossentropy as a scoring function instead of KLdivergence simplifies the sensitivity analysis while achieving the same result.
Claim 2.
The bounds in Claim 2 allow us to reuse the estimated sensitivity of the second private set when sampling from ()differentially private Gaussian mechanism.
4 Experiments and Results
4.1 Dataset
For all experiments we used the Reddit data, which consists of 4.4M users, as the private dataset and the Google Billion Word [1] as the public dataset. We extracted trigrams and their counts from these datasets to form our database. Accordingly, the vocabulary is the total number of distinct trigrams, which is determined only by the public dataset, and the output distribution of our mechanism is the joint trigram distribution or .
4.2 Baselines
We used two simple baselines in our experiments:

The other baseline is a modified Laplace mechanism that utilises public counts. We first normalise the public count, then for each user independently, we normalise the user counts and subtract the normalised public counts from it. After that, we get the weighted average of the subtracted output over all the users and add Laplace noise to it. Finally we add the normalised public counts, threshold at 0 and renormalise. In other words, the unnormalised output of the modified Laplace mechanism:
Where is the normalised public counts, and are the normalised counts and weighting of user and is the Laplace noise and is the sensitivity of the modified mechanism.
4.3 Experimentation setup
For all experiments, the allowed word counts per user is set to 1 if the number of users is less than 1M, otherwise . We also limit the total number of counts for each user to . These values were chosen empirically from early experiments to be good utilityprivacy tradeoffs. For our Bayesian mechanism, we used the bruteforce method to compute the sensitivity as it gives tighter bounds. In hyperparameter tuning, we split the private data into train set and validation set, where we spend on training and on private hyperparameter tuning, and .
4.4 Performance under different settings
We evaluate the utility of our mechanism, measured as the KLdivergence from the private data distribution, under different settings such as the number of users in private database, the size of vocabulary set and the quality of the public data. We also compare its utility with other baseline mechanisms, as shown in Figures 1 to 4.
4.5 Comparison with KAnonymity
We compare our mechanism to Kanonymity, by evaluating the KLdivergence of the method at different values of K as shown in Figure 5. For , we achieve similar utility to Kanonymity with . However, differential privacy offers superior privacy guarantees.
4.6 Performance in Language Modeling
We compare the perplexity of our mechanism applied to ngrams in language modeling to the other baselines and Kanonymity. Since the performance of ngram language models depends on the backoff strategies and how outofvocabulary words are handled, we constrain the data to only invocabulary words and sentences covered by trigrams so no backoff strategy is required, leading to a fair comparison in as shown in Table 1.
Mechanism  Perplexity 

Private baseline  29.48 
Kanonymity with K = 50  35.51 
Bayesian DP at  35.55 
Laplace DP + public data at  580.46 
Laplace DP at  894.32 
Public baseline  43.00 
5 Empirical Privacy Bounds By Membership Inference Attack
The reason why our Bayesian mechanism is outperforming the standard Laplace mechanism for the same is because our mechanism has tighter bounds on the privacy loss. This is due to the wordbased clipping strategy that minimizes the sensitivity effect of rare words. Moreover, the bruteforce method of estimating the sensitivity tied with the hyperparameter tuning allows us to find the optimal privacy vs utility tradeoff. To confirm this, we apply a membership inference attack [8] to the output of our Bayesian mechanism and compare against the Laplace mechanism. To perform the attack, we apply the mechanism to two adjacent datasets, one of which does not have the most contributing, in terms of user counts. We then sample two probability distributions from each dataset and repeat the same process times. The probability of membership inference is then the number of times the KLdivergence, relative to the removed user distribution, of is less than that of divided by . Plots in Figure 6 show that the probability of membership inference of the Laplace mechanism grows slowly with , even at the probability is still less than , whereas the inference probability of the Bayesian mechanism approaches as soon as . Therefore, the Bayesian mechanism gives a more accurate estimate of the true privacy loss , since from the definition of differential privacy, a indicates that the ratio between the output probability of the adjacent datasets is , which is a clear privacy violation. In other words, the Laplace mechanism is excessively pessimistic about and it is possible to achieve much higher utility at a lower using the Bayesian mechanism.
6 Conclusion
In this paper, we proposed a novel Bayesian differential privacy approach, applied to ngrams, which utilises public data to provide a significantly better utility vs privacy tradeoff compared to wellknown privacy mechanisms, such as the Laplace mechanism. In our approach, we transform the counts to log space, approximating the distribution of the public and private data as Gaussian. The posterior distribution is then evaluated and softmax is applied to produce a probability distribution. We performed several experiments on ngrams from the Reddit dataset to demonstrate the superior performance of our mechanism, achieving up to 85% reduction in KLdivergence compared to the Laplace mechanism for privacy loss , and similar KLdivergence performance to Kanonymity with . Finally, we applied a membership inference attack to explain the improvement of our mechanism over the Laplace mechanism. The attack showed that our mechanism provides tighter bounds on the privacy loss and thus gives better estimate of . Future work will investigate using a similar Bayesian approach to apply differential privacy to deep neural networks during Stochastic Gradient Descent (SGD) training.
Acknowledgements
The authors of this paper would like to acknowledge the SwiftKey Task and Intelligence Research team, in particular Joe Osborne and Dmitry Stratiychuk, for their insights and suggestions that guided the research. We would also like to acknowledge everyone that reviewed this paper and provided invaluable guidance and feedback.
Appendix A Proof of Theorem 1 and Corollary 1
We start by proving Theorem 1 where the Gaussian mechanism proposed can be composed into a deterministic function and an additive Gaussian noise .
Define , where is the set of all private users of a database and is the vocabulary size. Adopting the same notation used in theorem 1, let be an adjacent database to with user removed and let be the public and user counts respectively. The adjacency, , is expressed formally as and . Consequently, , where:
satisfies a ()DP Gaussian mechanism if the standard deviation , where
And rearranging this will give:
Corollary 1 follows from Theorem 1 if we replace with , where as defined by equation 5 and , since each user can only affect the number of users for any word by 1. As a result:
The identity function in the last equality stems from clamping the decay function at , i.e the gradient is zero for or . We note also that the first equality can be rearranged differently, which will come handy in the proving claim 1, as follows:
(7) 
Appendix B Proof of Claim 1
Claim 1 provides a worstcase estimate of the sensitivity , that does not require recomputing sensitivity of for each user . Firstly, we state the following lemma:
Lemma 1.
The sensitivity of , which is defined in equation 2, is bounded from above as follows:
Then, using this lemma, we start the proof from Equation 7:
The last two inequalities follow from the monotonicity of and Lemma 1 respectively.
Now we prove Lemma 1, by defining a strictly increasing function similar to :
The function has two important properties:

It is a concave function, since it’s a strictly increasing linear combination of

From the first property we can conclude that is Lipschitz over with respect to the if , since and are dual norms. Moreover:
The inequality follows because the smallest word count is equal to the number of users that have the word. Consequently, the maximum norm of the gradient: . From the Lipschitz property:
Appendix C Proof of Theorem 2
The first part of the theorem is a variant of the noisy max algorithm, with the relaxation that the score function does not have to be monotonic. To follow the same steps as the noisy max proof, we will prove the theorem for the maximum of the negative score function . This is allowed because of the symmetry of the Laplace noise.
Let and be the scores for the adjacent databases for and . For any , fix , a draw from used for all noisy scores except the k^{th} score. We will argue for each independently.
Define the notation to mean the probability that the output of the mechanism (index of the maximum noisy score) is , conditioned on .
First, we show that . Define
Since we want the maximum value, will be the output when the database is if and only if . Let , we have for all :
Thus, if , then the k^{th} score will be the maximum when the database is and the noise vector is . The probabilities below are over the choice of :
Rearranging and multiplying both sides by :
Now we show , following similar steps as above. Define
will be the output(argmax noisy score) when the database is if and only if we have for all :
Thus, if , then the k^{th} score will be the maximum when the database is and the noise vector is . The probabilities below are over the choice of :
And so:
Finally, since:
which is differentially private, it is trivial to show that is also differentially private, because the Laplace distribution is symmetric around and the above proof only requires the sensitivity of and no other assumption is made on the properties of .
The second part of the theorem is a direct consequence of having nonoverlapping private sets, i.e any user removed in can only be from one of the private sets and not the other, thus we can treat each set separately and take the maximum sensitivity of both.
Appendix D Proof of Claim 2
The sensitivity with respect to the first private set, , is a bruteforce evaluation of adding and removing each user in the set. This is bounded for all users if you take the maximum word count in the set as shown in the claim. Therefore, here we prove the upper bound of the sensitivity with respect to the second private set .
Let so . We first note that score function defined in equation 6 is a convex function in since it is a scaled version of KLdivergence. Consequently, is Lipschitz over with respect to the norm if , since is selfdual. Moreover, for all :
Thus is Lipschitz with . The Lipschitz property implies that:
Footnotes
 footnotetext: During his time at Microsoft
References
 (2013) One billion word benchmark for measuring progress in statistical language modeling. Technical report Google. External Links: Link Cited by: §4.1.
 (2012) Differentially private sequential data publication via variablelength ngrams. In Proceedings of the 2012 ACM conference on Computer and communications security, pp. 638–649. Cited by: §1.
 (2017) Differential privacy for bayesian inference through posterior sampling. The Journal of Machine Learning Research 18 (1), pp. 343–381. Cited by: §1, §3.
 (201408) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3â4), pp. 211â407. External Links: ISSN 1551305X, Link, Document Cited by: §2.1, §2.2, §3.2, item 1.
 (2006) Differential privacy, in automata, languages and programming. ser. Lecture Notes in Computer Scienc 4052, pp. 112. Cited by: §2.1, §2.2, item 1.
 (2020) Differentially private set union. arXiv preprint arXiv:2002.09745. Cited by: §1.
 (2017) anonymity: kanonymity with differential privacy. arXiv preprint arXiv:1710.01615. Cited by: §1.
 (2019) Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1895–1912. Cited by: §5.
 (2008) Lecture notes in statistical nlp. Department of Electrical Engineering and Computer Sciences at UC Berkeley. Cited by: §3.
 (2017) Accuracy first: selecting a differential privacy level for accuracy constrained erm. In Advances in Neural Information Processing Systems, pp. 2566–2576. Cited by: §3.2.
 (2019) Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298–309. Cited by: §3.2.
 (1998) Choice of basis for laplace approximation. Machine learning 33 (1), pp. 77–86. Cited by: §3.
 (2016) Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pp. 956–964. Cited by: §3.
 (2019) AdaCliP: adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643. Cited by: §3.1.
 (1998) Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. Cited by: §1.
 (2017) Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488. Cited by: §3.
 (2015) Privacy for free: posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pp. 2493–2502. Cited by: §3.
 (2015) The differential privacy of bayesian inference. Bachelor’s thesis, Harvard College. Cited by: §3.