DataDependent Differentially Private Parameter Learning for Directed Graphical Models
Abstract
Directed graphical models (DGMs) are a class of probabilistic models that are widely used for predictive analysis in sensitive domains, such as medical diagnostics. In this paper we present an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. Our solution optimizes for the utility of inference queries over the DGM and adds noise that is customized to the properties of the private input dataset and the graph structure of the DGM. To the best of our knowledge, this is the first explicit datadependent privacy budget allocation algorithm for DGMs. We compare our algorithm with a standard dataindependent approach over a diverse suite of DGM benchmarks and demonstrate that our solution requires a privacy budget that is smaller to obtain the same or higher utility.
1 Introduction
Directed graphical models (DGMs) are widely used in causal reasoning and predictive analytics where prediction interpretability is desirable [32]. A typical use case of these models is in answering “whatif” queries over domains that work with sensitive information. For example, DGMs are used in medical diagnosis for answering questions like what is the most probable disease given a set of symptoms [43]. In learning such models, it is common that the model’s graph structure is publicly known. For example, in the case of a medical data set the dependencies between several physiological symptoms and diseases are well established and standardized. However, the parameters of the model have to be learned from observations that correspond to the random variables of the model. These observations may contain sensitive information, as in the case of medical applications; thus, learning and publicly releasing the parameters of the probabilistic model may lead to privacy violations [46, 59]. This establishes the need for privacypreserving learning mechanisms for DGMs.
In this paper, we focus on the problem of privacypreserving learning of the parameters of a DGM. For our privacy measure, we use differential privacy (DP) [17] – the defacto standard for privacy. We consider the case when the structure of the target DGM is publicly known and the parameters of the model are learned from fully observed data. In this case, all parameters can be estimated via counting queries over the input observations (also referred to as data set in the remainder of the paper). One way to ensure privacy is to add suitable noise to the observations using the standard Laplace mechanism [17]. Unfortunately, this method is dataindependent, i.e., the noise added to the base observations is agnostic to properties of the input data set and the structure of the DGM, and degrades utility. To address this issue, we turn to datadependent methods, i.e., methods in which the added noise is customized to the properties of the input data sets [37, 62, 10, 2, 56, 55, 33].
We propose a datadependent, DP algorithm for learning the parameters of a DGM over fully observed data. Our goal is to minimize errors in arbitrary inference queries that are subsequently answered over the learned DGM. We outline our main contributions as follows:
(1) Explicit datadependent privacybudget allocation: Our algorithm computes the parameters of the conditional probability distribution of each random variable in the DGM via separate measurements from the input data set; this gives rise to an opportunity to optimize the privacy budget allocation across the different variables with the objective of reducing the error in inference queries. We formulate this objective in a datadependent manner; the objective is informed by both the private input data set and the public graph structure of the DGM. To the best of our knowledge this is the first paper to propose explicit datadependent privacybudget allocation for DGMs. We evaluate our algorithm on four benchmark DGMs. We demonstrate that our scheme only requires a privacy budget of to yield the same utility that a standard dataindependent method achieves with .
(2) New theoretical results: To preserve privacy we add noise to the parameters of the DGM. To understand how this noise propagates to inference queries, we provide two new theoretical results on the upper and lower bound of the error of inference queries. The upper bound has an exponential dependency on the treewidth of the input DGM while the lower bound depends on its maximum degree. We also provide a formulation to compute the sensitivity [35] of the parameters associated with a node of a DGM targeting only the probability distribution of its child nodes only.
2 Background
We review basic background material for the problems and techniques discussed in this paper.
Directed Graphical Models: A directed graphical model (DGM) or a Bayesian network is a probabilistic model that is represented as a directed acyclic graph . The nodes of the graph represent random variables and the edges encode conditional dependencies between the variables. The graphical structure of the DGM represents a factorization of the joint probability distribution of these random variables. Specifically, given a DGM with graph , let be the random variables corresponding to the nodes of and denote the set of parents in for the node corresponding to variable . The joint probability distribution factorizes as where each factor corresponds to a conditional probability distribution (CPD). For DGMs with discrete random variables, each CPD can be represented as a table of parameters where each parameter corresponds to a conditional probability and and denote variable assignments and .
A key task in DGMs is parameter learning. Given a DGM with known graph structure , the goal of parameter learning is to estimate each a task solved via maximum likelihood estimation (MLE). In the presence of fully observed data over the random variables of ^{1}^{1}1The attributes of the dataset become the nodes of the DGM’s graph. For the remainder of the paper we use them interchangeably depending on the context., the maximum likelihood estimates of the CPD parameters take the closedform [32]:
(1) 
where denotes the number of records in such that .
After learning, the fully specified DGM can be used to answer inference queries, i.e., queries that seek to compute the marginal or conditional probabilities of certain events (variables) of interest. Inference queries can also involve evidence, in which case the assignment of a subset of nodes is fixed. Inference queries are of three types: (1) marginal inference, (2) conditional inference, and (3) maximum a posteriori (MAP) inference. We refer the reader to Appendix 8.1 for more details.
For DGMs, inference queries can be answered exactly by the variable elimination (VE) algorithm [32], which is described in detail in Appendix 8.1.1. The basic idea is that we "eliminate" one variable at a time following a predefined order over the graph nodes. Let denote a set of probability factors (initialized with all the CPDs of the DGM) and denote the variable to be eliminated. First, all probability factors involving are removed from and multiplied together to generate a new product factor. Next is summed out from this combined factor generating a new factor that is entered into . VE corresponds to repeated sumproduct computations: .
Differential Privacy: We define differential privacy as follows:
Definition 2.1.
An algorithm satisfies differential privacy (DP), where is a privacy parameter, iff for any two datasets and that differ in a single record, we have
(2) 
A result by [47, 31, 39] shows that the privacy guarantee of a differentially private algorithm can be amplified by a preceding sampling step. Let be an DP algorithm and is an input data set. Let be an algorithm that runs on a random subset of obtained by sampling it with probability .
Lemma 2.1.
Algorithm will satisfy DP where
3 DataDependent Differentially Private Parameter Learning for DGMs
We start with an overview of differntially private learning for DGMs over fully observed data and our datadependent algorithm for DP. We then present our algorithm in detail.
3.1 Problem and Approach Overview
Let be a sensitive data set of size with attributes and let be the DGM of interest. The graph structure of defined over the attribute set is publicly known. Our goal is to learn the parameters , i.e., the CPDs of , in a datadependent differentially private manner from . In addition to guaranteeing DP, we seek to optimize the way noise is introduced in the learned parameters such that the error in inference queries over the DP DGM is minimized. Errors are defined by comparing the inference results over the DP DGM to the inference results obtained over the MLE DGM with no noise injection.
We build upon two observations to solve the above problem:
(1) Parameters of DGM can be estimated separately via counting queries over the empirical marginal table (joint distribution ) of the attribute set .
(2) The factorization over decomposes the overall DP learning problem into a series of separate DP learning subproblems (one for each CPD). As a result the total privacy budget has to be divided among these subproblems. However, owing to the structure of the graph and the data set, certain nodes will have more impact on an inference query than others. Hence, allocating more budget (and thereby getting better accuracy) to these nodes will result in reduced overall error for the inference queries. Thus, careful privacy budget allocation across the marginal tables can lead to better utility guarantees as compared to a naive budgeting scheme that assigns equal budget to all marginal tables.
The above observations are key to our proposed algorithm, which computes the parameters of from their respective marginal tables with datadependent noise injections to guarantee DP. Our method is outlined in Algorithm 2 and proceeds in two stages. In the first stage, we obtain preliminary noisy measurements of the parameters of , which are used along with graph specific properties (i.e., the height and outdegree of each node) to formulate a datadependent optimization objective for privacy budget allocation. The solution of this objective is then used in the second stage to compute the final parameters. In summary, if is the total privacy budget available, we spend of it to obtain preliminary parameter measurements in Stage I and the remaining is used for the final parameter computation in Stage II, after optimal allocation across the marginal tables.
Next we describe our algorithm in detail and highlight how we address two core technical challenges in our approach: (1) how to reduce the privacy budget cost for the first stage (equivalently increase ) (see Alg. 2, Lines 13), and (2) what properties of the input dataset and graph should the optimization objective be based on (see Alg. 2, Lines 510).
3.2 Algorithm Description
We now describe the two stages of our core method in turn:
Stage I – Formulation of optimization objective: First, we handle the tradeoff between the two parts of the total privacy budget and . While one wants to maximize to reduce the amount of noise in the final parameters, sufficient budget is required to obtain good estimates of the statistics of the data set that are required to form the datadependent budget allocation objective. To handle this tradeoff, we use the privacy amplification Lemma 2.1 to improve the accuracy of the optimization objective computation (Alg. 2, Lines 12). This allows us to assign a relatively low value to so that our budget for the final parameter computation is increased.
Next, we estimate the parameters on the aforementioned sampled data set via the procedure ComputeParameters (described below and outlined in Algorithm 1) using budget allocation (Alg. 2, Lines 34). Note that, is only required for the optimization objective formulation and is different from the final parameters (Alg. 2, Line 13). Hence, for we use a naive allocation of policy of equal privacy budget for all tables by using ComputeParameters (Alg. 2, Line 4).
Finally, lines 512 of Algorithm 2 correspond to the estimation of the privacy budget optimization objective that depends on the data set and graph structure and is detailed in Section 3.3.
Stage II – Final parameter computation: We solve for the optimal privacy budget allocation from and compute the final parameters using ComputeParameters (Alg. 2, Lines 1315).
Procedure ComputeParameters: This procedure is outlined in Algorithm 1. Its goal is, given a privacy budget allocation , to derive the parameters of in a differentially private manner. To this end, the algorithm first materializes the tables for the attribute sets and (Alg. 1, Line 2), and then injects noise drawn from (using half of the privacy budget for each table) into each of their cells (Alg. 1, Line 3) to generate and respectively [17]. Next, it converts to a marginal table , i.e., joint distribution (Alg. 1, Line 4) using . This is followed by ensuring that all s are mutually consistent on all subsets of attributes using the method described in Appendix 8.2.1. Finally, the algorithm derives (the noisy estimate of ) from (Lines 710). Note that although could have been derived from alone, we also use the noisy counts from for its computation to ensure independence of the added noise in eq. (3).
3.3 Optimal Privacy Budget Allocation
Our goal is to find the optimal privacy budget allocation over the marginal tables, for such that the error in the subsequent inference queries on is minimized.
Observation I: A more accurate estimate of the parameters of will result in better accuracy for the subsequent inference queries. Hence, we focus on reducing the total error of the parameters of . From Eq (1) and our Laplace noise injection (Alg. 1, Line 3), for a privacy budget of , the value of a parameter of the DGM computed from the noisy marginal tables is expected to be
(3) 
From the rules of standard error propagation [20], the error in is

(4) 
where denotes the number of records in with . Thus the mean error for the parameters of is where is the domain of the attribute set . Since using the true values of the counts would lead to privacy violation, our algorithm uses the noisy estimates from and (Alg. 2, Line 8).
Observation II: Depending on the data set and the graph structure, different nodes will have different impact on the final inference queries. This information can be captured by a corresponding weighting coefficient for each node.
Let denote the optimal privacy budget for node . Thus from the above observations, we formulate our optimization objective as a weighted sum of the parameter error as follows

(5)  

(6)  

(7) 
where is the height of the node , is the outdegree of , is the sensitivity [35] of the parameters of , gives the measure for estimated mean error for the parameters (CPD) of and the denominator of the last term captures the linear constraint .
Computation of weighting coefficient : For a given node , the weighting coefficient is computed from the following three node features:
(1) Height of the node : The height of a node is defined as the length of the longest path between and a leaf node. Due to the factorization of conditional probability distributions, the marginal probability distribution of a node depends only on the set of its ancestor nodes (as is explained in the following discussion on the computation of sensitivity). Thus, a node with large height will affect the inference queries of more nodes (all its successors) than say a leaf node.
(2) Outdegree of the node : A node causally affects all its children nodes. Thus the impact of a node, with high fanout degree, on inference queries will be more than say a leaf node.
(3) Sensitivity : Sensitivity of a parameter in a DGM measures the impact of small changes in the parameter value on a target probability. Laskey [35] proposed a method of computing the sensitivity by using the partial derivative of output probabilities with respect to the parameter being varied. However, previous work have mostly focused on the target probability to be a joint distribution of all the variables. In this paper we present a method to compute sensitivity targeting the probability distribution of child nodes only. Let denote the mean sensitivity of the parameters of on target probabilities of all the nodes in = . Formally,

(8) 
Observe that a node can affect another node only iff (its Markov blanket–Defn. 8.1). However due to the factorization of the joint distribution of all random variables of a DGM, can be expressed without . Thus just computing the mean sensitivity of the parameters over the set of child nodes turns out to be a good weighting metric for our setting. for leaf nodes is thus 0. Note that is distinct from the notion of sensitivity of a function used in Laplace mechanism (Appendix 8.1.2).
Computing Sensitivity : Let and denote the set of all nodes such that there is a directed path from to . Basically denotes the set of ancestors of in . From the factorization of it is easy to see that from the conditional probability distributions of of the nodes in it is sufficient to compute as follows
Therefore, using our noisy preliminary parameter estimates (Alg 2, Line 4), we compute

(9) 
where ensures that only the product terms involving parameters with attributes in that match up with the corresponding values in are retained in the computation (as all others terms have partial derivative 0). Thus the noisy mean sensitivity estimate for the parameters of node , can be computed from Eq. (8) and (9) (Alg. 2, Line 9).
For example, for the DGM given by Figure 2 for node (assuming binary attributes for simplicity) we need to compute the sensitivity of its parameters on the target probability of , i.e., and which is computed as . The rest of the partial derivatives are computed in a similar manner to give us .
Thus the weighting coefficient is defined as the product of the aforementioned three features and given by Eq. (7). The extra additive term is used to handle leaf nodes so that the weighting coefficients are nonzero. Assuming , has a closed form solution as follows
(10) 
Discussion: Note that there are two paradigms of information to be considered for a DGM  (1) graph structure (2) data set . and are purely graph characteristics and they summarise the graphical properties of the node . captures the interactions of the graph structure with the actual parameter values thereby encoding the data set dependent information. Hence we theorize that the aforementioned three features are sufficient for constructing the weighting coefficients.
3.4 Privacy Analysis
Theorem 3.1.
The proposed algorithm (Algorithm 2) for learning the parameters of a directed graphical model over fully observed data is differentially private.
Proof.
The sensitivity of counting queries is 1. Hence, the computation of the noisy tables (Alg 1, Line 23) is a straightforward application of Laplace mechanism (Section 8.1.2). This together with Lemma 2.1 makes the computation of , DP. Now the subsequent computation of the optimal privacy budget allocation is a postprocessing operation on and hence by Theorem 8.4 is still DP. The final parameter computation is clearly ()DP. Thus by the theorem of sequential composition (Theorem 8.3), Algorithm 2 is DP. The DGM thus learned can be released publicly and any inference query run on it will still be DP (from Theorem 8.4). ∎
4 Error Analysis for Inference Queries
As discussed in Section 3, our optimization objective minimizes a weighted sum of the parameter errors. To understand how the error propagates from the parameters to the inference queries, we present two general results bounding the error of a sumproduct term of the VE algorithm, given the errors in the factors.
Theorem 4.1.
[Lower Bound] For a DGM , for any sumproduct term of the form in the VE algorithm, we have
(11) 
where is the attribute being eliminated, is the set of attributes in , .
Theorem 4.2.
[Upper Bound] For a DGM , for any sumproduct term of the form , in the VE algorithm with the optimal elimination order, we have
where is the attribute being eliminated, is the treewidth of , is the maximum domain size of an attribute, is the set of attributes in , , and
For proving the lower bound, we introduce a specific instance of the DGM based on Lemma 8.1. For the upper bound, with the optimal elimination order of the VE algorithm, the maximum error has an exponential dependency on the treewidth . This is very intuitive as even the complexity of the VE algorithm has the same dependency on . The answer of a marginal inference query is the factor generated from the last sumproduct term. Also, since the initial set of s for the first sumproduct term computation are the actual parameters of the DGM, all the errors in the subsequent intermediate factors and hence the inference query itself can be bounded by functions of parameter errors using the above theorems.
5 Evaluation
We now evaluate the utility of the DGM learned under differential privacy using our algorithm. Specifically, we study the following three questions: (1) Does our scheme lead to low error approximation of the DGM parameters? (2) Does our scheme result in low error inference query responses? (3) How does our scheme fare against dataindependent approaches?
Evaluation Highlights: First, focusing on the parameters of the DGM, we find that our scheme achieves low L1 error (at most for ) and low KL divergence (at most for ) across all test data sets. Second, we find that for marginal and conditional inferences, our scheme provides low L1 error and KL divergence (both around at max for ) for all test data sets. Our scheme also provides high accuracy for MAP queries ( accuracy for averaged over all test data sets). Finally, our scheme achieves strictly better utility than the dataindependent baseline; our scheme only requires a privacy budget of to yield the same utility that the dataindependent baseline achieves with .
5.1 Experimental Setup
Data sets: We evaluate our proposed scheme on four benchmark DGMs [8] –
(1) Asia: Number of nodes – 8; Number of arcs – 8; Number of parameters – 18
(2) Sachs: Number of nodes – 11; Number of arcs – 17; Number of parameters – 178
(3) Child: Number of nodes – 20; Number of arcs – 25; Number of parameters – 230
(4) Alarm: Number of nodes – 37; Number of arcs – 46; Number of parameters – 509
For all four DGMs, the evaluation is carried out on corresponding synthetic data sets [12, 13] with 10,000 records each.
Metrics: For conditional and marginal inference queries we compute the following two metrics: L1error, and KL divergence, = where denotes either a true CPD of the DGM (parameter) or a true marginal/conditional inference query response and is the corresponding noisy estimate obtained from our proposed scheme. For answering MAP inferences, we compute .
Setup: We evaluate each data set on 20 random inference queries (10 marginal inference, 10 conditional inference) and report mean error over 10 runs. For MAP queries, we run 20 random queries and report the mean result over 10 runs. The queries are of the form where attribute subsets and are varied from being singletons up to the full attribute set. We compare our results with a standard dataindependent baseline (denoted by DInd) [63, 61] which corresponds to executing Algorithm 1 on the entire input data set and the privacy budget array . All the experiments have been implemented in Python and we set .
5.2 Experimental Results
Figure 1 shows the mean and for noisy parameters and marginal and conditional inferences for the data sets Sachs and Child. The main observation is that our scheme achieves strictly lower error than that of DInd; specifically our scheme only requires a privacy budget of to yield the same utility that DInd achieves with . In most practical scenarios, the value of typically does not exceed [29]. In Table 1, we present our experimental results for MAP queries. We see that our scheme achieves higher accuracy. For example, our scheme provides an accuracy of at least while DInd achieves accuracy for . Finally, given a marginal inference query , we compute the scale normalized error in as where and are the upper and lower bound respectively computed using Theorem 4.2 and Theorem 4.1^{2}^{2}2 and are computed separately for each run of the experiment from their respective empirical parameter errors.. Clearly, the lower the value of is the closer it is to the lower bound and vice versa. We report the mean value of for 20 random inference queries (marginal and conditional) for in Table 2. We observe that the errors are closer to their respective lower bounds. This is more prominent for the errors obtained from our datadependent scheme than those of DInd.
Asia  Sachs  Child  Alarm  
DInd  Our  DInd  Our  DInd  Our  DInd  Our  
Scheme  Scheme  Scheme  Scheme  Scheme  Scheme  Scheme  Scheme  
1  0.88  1  0.81  0.86  0.79  0.93  0.89  0.95 
1.5  0.93  1  0.87  0.93  0.83  0.95  0.92  0.98 
2  1  1  0.92  0.98  0.89  0.97  0.95  1 
2.5  1  1  0.96  1  1  1  1  1 
3  1  1  1  1  1  1  1  1 
Asia  Sachs  Child  Alarm  

DInd  0.008  0.065  0.0035  0.0046 
Scheme  
Our  0.0035  0.04  0.0014  0.0012 
Scheme 
Thus we conclude that the nonuniform budget allocation in our datadependent scheme gives better utility than an uniform budget allocation. For example, for a total privacy budget in DGM Sachs, our scheme assigns the highest budget () to node "PKC" which is the root with 5 child nodes and least budget to "Jnk" () which is a leaf node .
6 Related Work
In this section we review related literature. There has been a steadily growing amount work in differentially private machine learning models for the last couple of years. We list some of the most recent work in this line (not exhaustive list). [1, 53, 3] address the problem of differentially private SGD. The authors of [41] present an algorithm for differentially private expectation maximization. In [36] the problem of differentially private Mestimators is addressed. Algorithms for performing expected risk minimization under differential privacy has been proposed in [49, 9]. In [50] two differentially private subspace clustering algorithms are proposed.
There has been a fair amount of work in differentially private Bayesian inferencing and related notions [15, 51, 21, 63, 22, 7, 28, 60, 45, 40, 30, 5, 6, 19, 61]. In [28] the authors present a solution for DP Bayesian learning in a distributed setting, where each party only holds a subset of the data a single sample or a few samples of the data. In [19] the authors show that a datadependent prior learnt under DP yields a valid PACBayes bound. The authors in [52] show that probabilistic inference over differentially private measurements to derive posterior distributions over the data sets and model parameters can potentially improve accuracy. An algorithm to learn an unknown probability distribution over a discrete population from random samples under DP is presented in [14]. In [7] the authors present a method for private Bayesian inference in exponential families that learns from sufficient statistics. The authors of [51] and [15] show that posterior sampling gives differential privacy "for free" under certain assumptions. In [21] the authors show that Laplace mechanism based alternative for "One Posterior Sample" is as asymptotically efficient as nonprivate posterior inference, under general assumptions. A Rényi differentially private posterior sampling algorithm is presented in [22]. [60] proposes a differential private Naive Bayes classification algorithm for data streams. [63] presents algorithms for private Bayesian inference on probabilistic graphical models. In [40], the authors introduce a general privacypreserving framework for Variational Bayes. An expressive framework for writing and verifying differentially private Bayesian machine learning algorithms is presented in [5]. The problem of learning discrete, undirected graphical models in a differentially private way is studied in [6]. [45] presents a general method for privacypreserving Bayesian inference in Poisson factorization. In [63] the authors propose algorithms for private Bayesian inference on graphical models. However, their proposed solution does not add datadependent noise. In fact their proposed algorithms (Algorithm 1 and Algorithm 2 as in [63]) are essentially the same in spirit as our baseline solution DInd. Moreover, some proposals from [63] can be combined with DInd; for example to ensure mutual consistency, [63] adds Laplace noise in the Fourier domain while DInd uses techniques of [26]. DInd is also identical (DInd has an additional consistency step) to an algorithm used in [61] which uses DGMs to generate highdimensional data.
A number of datadependent differentially private algorithms have been proposed in the past few years. [2, 56, 62, 55] outline datadependent mechanisms for publishing histograms. In [10] the authors construct an estimate of the dataset by building differentially private kdtrees. MWEM [25] derives estimate of the dataset iteratively via multiplicative weight updates. In [37] differential privacy is achieved by adding data and workload dependent noise. [34] presents a datadependent differentially private algorithm selection technique. [24, 18] present two general datadependent differentially private mechanisms. Certain dataindependent mechanisms attempt to find a better set of measurements in support of a given workload. One of the most prominent technique is the matrix mechanism framework [58, 38] which formalizes the measurement selection problem as a rankconstrained SDP. Another popular approach is to employ a hierarchical strategy [27, 11, 54]. [57, 4, 16, 23, 48, 26] propose techniques for marginal table release.
7 Conclusion
In this paper we have proposed an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. The noise added is customized to the private input data set as well as the public graph structure of the DGM. To the best of our knowledge, we propose the first explicit datadependent privacy budget allocation mechanism for DGMs. Our solution achieves strictly higher utility than that of a standard dataindependent approach; our solution requires at least smaller privacy budget to achieve the same or higher utility.
References
 [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 308–318, New York, NY, USA, 2016. ACM.
 [2] G. Acs, C. Castelluccia, and R. Chen. Differentially private histogram publishing through lossy compression. In 2012 IEEE 12th International Conference on Data Mining, pages 1–10, Dec 2012.
 [3] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communicationefficient and differentiallyprivate distributed sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7564–7575. Curran Associates, Inc., 2018.
 [4] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the Twentysixth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’07, pages 273–282, New York, NY, USA, 2007. ACM.
 [5] Gilles Barthe, Gian Pietro Farina, Marco Gaboardi, Emilio Jesus Gallego Arias, Andy Gordon, Justin Hsu, and PierreYves Strub. Differentially private bayesian programming. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 68–79, New York, NY, USA, 2016. ACM.
 [6] Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, and Gerome Miklau. Differentially private learning of undirected graphical models using collective graphical models. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pages 478–487. JMLR.org, 2017.
 [7] Garrett Bernstein and Daniel R Sheldon. Differentially private bayesian inference for exponential families. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2919–2929. Curran Associates, Inc., 2018.
 [8] http://www.bnlearn.com/bnrepository/.
 [9] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069–1109, July 2011.
 [10] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, pages 20–31, Washington, DC, USA, 2012. IEEE Computer Society.
 [11] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, pages 20–31, Washington, DC, USA, 2012. IEEE Computer Society.
 [12] https://github.com/albertofranzin/datathesis.
 [13] https://www.ccd.pitt.edu/wiki/index.php/data_repository.
 [14] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2566–2574. Curran Associates, Inc., 2015.
 [15] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin I. P. Rubinstein. Robust and private bayesian inference. In Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 291–305, Cham, 2014. Springer International Publishing.
 [16] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. Differentially private data cubes: Optimizing noise sources and consistency. In Test, pages 217–228, 01 2011.
 [17] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014.
 [18] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and differential privacy. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS ’10, pages 51–60, Washington, DC, USA, 2010. IEEE Computer Society.
 [19] Gintare Karolina Dziugaite and Daniel M Roy. Datadependent pacbayes priors via differential privacy. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8430–8441. Curran Associates, Inc., 2018.
 [20] https://en.wikipedia.org/wiki/propagation_of_uncertainty.
 [21] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacypreserving bayesian data analysis. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 192–201, Arlington, Virginia, United States, 2016. AUAI Press.
 [22] Joseph Geumlek, Shuang Song, and Kamalika Chaudhuri. Rényi differential privacy mechanisms for posterior sampling, 2017.
 [23] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the Fortythird Annual ACM Symposium on Theory of Computing, STOC ’11, pages 803–812, New York, NY, USA, 2011. ACM.
 [24] Anupam Gupta, Aaron Roth, and Jonathan Ullman. Iterative constructions and private data release. In Proceedings of the 9th International Conference on Theory of Cryptography, TCC’12, pages 339–356, Berlin, Heidelberg, 2012. SpringerVerlag.
 [25] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, pages 2339–2347, USA, 2012. Curran Associates Inc.
 [26] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(12):1021–1032, Sep 2010.
 [27] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow., 3(12):1021–1032, September 2010.
 [28] Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma, and Antti Honkela. Differentially private bayesian learning on distributed data. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3226–3235. Curran Associates, Inc., 2017.
 [29] Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, Benjamin C. Pierce, and Aaron Roth. Differential privacy: An economic method for choosing epsilon. 2014 IEEE 27th Computer Security Foundations Symposium, Jul 2014.
 [30] Joonas Jälkö, Antti Honkela, and Onur Dikmen. Differentially private variational inference for nonconjugate models. CoRR, abs/1610.08749, 2016.
 [31] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, June 2011.
 [32] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning. The MIT Press, 2009.
 [33] Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Pythia: Data dependent differentially private algorithm selection. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1323–1337, New York, NY, USA, 2017. ACM.
 [34] Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Pythia: Data dependent differentially private algorithm selection. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1323–1337, New York, NY, USA, 2017. ACM.
 [35] K. B. Laskey. Sensitivity analysis for probability assessments in bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, 25(6):901–909, June 1995.
 [36] Jing Lei. Differentially private mestimators. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 361–369. Curran Associates, Inc., 2011.
 [37] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. A data and workloadaware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, Jan 2014.
 [38] Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor. Optimizing linear counting queries under differential privacy. In Proceedings of the Twentyninth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’10, pages 123–134, New York, NY, USA, 2010. ACM.
 [39] Ninghui Li, Wahbeh H. Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In 7th ACM Symposium on Information, Compuer and Communications Security, ASIACCS ’12, Seoul, Korea, May 24, 2012, pages 32–33, 2012.
 [40] Mijung Park, James R. Foulds, Kamalika Chaudhuri, and Max Welling. Variational bayes in private settings (vips). CoRR, abs/1611.00340, 2016.
 [41] Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, and Max Welling. Dpem: Differentially private expectation maximization, 2016.
 [42] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
 [43] Judea Pearl. Graphical Models for Probabilistic and Causal Reasoning, pages 367–389. Springer Netherlands, Dordrecht, 1998.
 [44] Wahbeh Qardaji, Weining Yang, and Ninghui Li. Priview: Practical differentially private release of marginal contingency tables. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1435–1446, New York, NY, USA, 2014. ACM.
 [45] Aaron Schein, Zhiwei Steven Wu, Mingyuan Zhou, and Hanna M. Wallach. Locally private bayesian inference for count models. CoRR, abs/1803.08471, 2018.
 [46] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, May 2017.
 [47] https://adamdsmith.wordpress.com/2009/09/02/samplesecrecy/.
 [48] Justin Thaler, Jonathan Ullman, and Salil Vadhan. Faster algorithms for privately releasing marginals. Lecture Notes in Computer Science, page 810–821, 2012.
 [49] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2722–2731. Curran Associates, Inc., 2017.
 [50] Yining Wang, YuXiang Wang, and Aarti Singh. Differentially private subspace clustering. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pages 1000–1008, Cambridge, MA, USA, 2015. MIT Press.
 [51] YuXiang Wang, Stephen E. Fienberg, and Alexander J. Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2493–2502. JMLR.org, 2015.
 [52] Oliver Williams and Frank Mcsherry. Probabilistic inference and differential privacy. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2451–2459. Curran Associates, Inc., 2010.
 [53] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolton differential privacy for scalable stochastic gradient descentbased analytics. Proceedings of the 2017 ACM International Conference on Management of Data  SIGMOD ’17, 2017.
 [54] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEE Trans. on Knowl. and Data Eng., 23(8):1200–1214, August 2011.
 [55] Y. Xiao, J. Gardner, and L. Xiong. Dpcube: Releasing differentially private data cubes for health information. In 2012 IEEE 28th International Conference on Data Engineering, pages 1305–1308, April 2012.
 [56] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In 2012 IEEE 28th International Conference on Data Engineering, pages 32–43, April 2012.
 [57] Grigory Yaroslavtsev, Cecilia M. Procopiuc, Graham Cormode, and Divesh Srivastava. Accurate and efficient private release of datacubes and contingency tables. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE ’13, pages 745–756, Washington, DC, USA, 2013. IEEE Computer Society.
 [58] Ganzhao Yuan, Zhenjie Zhang, Marianne Winslett, Xiaokui Xiao, Yin Yang, and Zhifeng Hao. Lowrank mechanism:optimizing batch queries under differential privacy. Proceedings of the VLDB Endowment, 5(11):1352–1363, Jul 2012.
 [59] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2016.
 [60] G. Zhang and S. Li. Research on differentially private bayesian classification algorithm for data streams. In 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA), pages 14–20, March 2019.
 [61] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst., 42(4):25:1–25:41, October 2017.
 [62] Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 587–595, 2014.
 [63] Zuhe Zhang, Benjamin I. P. Rubinstein, and Christos Dimitrakakis. On the differential privacy of bayesian inference. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2365–2371. AAAI Press, 2016.
8 Appendix
8.1 Background Cntd.
8.1.1 Directed Graphical Models Cntd.
Definition 8.1 (Markov Blanket).
The Markov blanket for a node in a graphical model, denoted by , is the set of all nodes whose knowledge is sufficient to predict and hence shields from rest of the network [42]. In a DGM, the Markov blanket of a node consists of its child nodes, parent nodes and the parents of its child nodes.
For example, in Figure 2, .
Learning Cntd.
As mentioned before in Section 2, for a fully observed DGM, the parameters (CPDs) are computed via maximum likelihood estimation (MLE). Let the data set be with i.i.d records with attribute set . The likelihood is then given by
(12) 
where represents the probability that given .
Taking logs and rearranging, this reduces to
(13) 
where denotes the number of records in data set such that . Thus, maximization of the (log) likelihood function decomposes into separate maximizations for the local conditional distributions which results in the closed form solution
(14) 
Inference Cntd.
There are three types of inference queries in general:
(1) Marginal inference: This is used to answer queries of the type "what is the probability of a given variable if all others are marginalized". An example marginal inference query is .
(2) Conditional Inference: This type of query answers the probability distribution of some variable conditioned on some evidence . An example conditional inference query is .
(3) Maximum a posteriori (MAP) inference: This type of query asks for the most likely assignment of variables. An example of MAP query is
Variable Elimination Algorithm (VE): The complete VE algorithm is given by Algorithm 3. The basic idea of the variable elimination algorithm is that we "eliminate" one variable at a time following a predefined order over the nodes of the graph. Let denote a set of probability factors which is initialized as the set of all CPDs of the DGM and denote the variable to be eliminated. For the elimination step, firstly all the probability factors involving the variable to be eliminated, are removed from and multiplied together to generate a new product factor. Next, the variable is summed out from this combined factor generating a new factor that is entered into . Thus the VE algorithm essentially involves repeated computation of a sumproduct task of the form
(15) 
The complexity of the VE algorithm is defined by the size of the largest factor. Here we state two lemmas regarding the intermediate factors which will be used in Section 8.3.
Lemma 8.1.
Lemma 8.2.
The size of the largest intermediary factor generated as a result of running of the VE algorithm on a DGM is at least equal to the treewidth of the graph [32].
Corollary.
The complexity of the VE algorithm with the optimal order of elimination depends on the treewidth of the graph.
8.1.2 Differential Privacy Cntd.
When applied multiple times, the differential privacy guarantee degrades gracefully as follows.
Theorem 8.3 (Sequential Composition).
If and are DP and DP algorithms that use independent randomness, then releasing the outputs on database satisfies DP.
Another important result for differential privacy is that any postprocessing computation performed on the noisy output of a differentially private algorithm does not cause any loss in privacy.
Theorem 8.4 (PostProcessing).
Let be a randomized algorithm that is DP. Let be an arbitrary randomized mapping. Then is DP.
Laplace Mechanism: In the Laplace mechanism, in order to publish where , differentially private mechanism publishes where is known as the sensitivity of the query. The pdf of is given by . The sensitivity of the function basically captures the magnitude by which a single individual’s data can change the function in the worst case. Therefore, intuitively, it captures the uncertainty in the response that we must introduce in order to hide the participation of a single individual. For counting queries the sensitivity is 1.
8.2 DataDependent Differentially Private Parameter Learning for DGMs Cntd.
8.2.1 Consistency between noisy marginal tables
The objective of this step is to input the set of noisy marginal tables and compute perturbed versions of these tables that are mutually consistent.
Definition 8.2.
Two noisy marginal tables and are defined to be consistent (denoted by ) if and only if the marginal table over attributes in reconstructed from is exactly the same as the one reconstructed from , that is,
(16) 
where is the set of attributes on which marginal table is defined.
Mutual Consistency on a Set of Attributes:
Assume a set of tables and let . Mutual consistency, i.e., is achieved as follows:
(1) First compute the best approximation for the marginal table for the attribute set as follows
(17) 
(2) Update all s to be consistent with . Any counting query is now answered as
(18) 
where is the query restricted to attributes in and is the response of on .
Overall Consistency:
(1) Take all sets of attributes that
are the result of the intersection of some subset of ; these
sets form a partial order under the subset relation.
(2) Obtain a topological sort of these sets, starting from
the empty set.
(3) For each set , one finds all tables that
include , and ensures that these tables are consistent on .
8.3 Error Bound Analysis Cntd.
In this section we present our results on the lower and upper bound of the error in inference queries.
Preliminaries and Notations:
For the proofs, we use the following notations.
Let be the attribute that is being eliminated and let where denotes the set of attributes in .
For some , from the variable elimination algorithm (Section 8.1.3) for a sumproduct term (Eq. (15)) we have
(19) 
Let us assume that factor denotes that and . Recall that after computing a sumproduct task (given by Eq. (19)), for the variable elimination algorithm (Appendix Algorithm 3), we will be left with a factor term over the attribute set . For example, if the elimination order for the variable elimination algorithm on our example DGM (Figure 2) is given by and the attributes are binary valued, then the first sumproduct task will be of the following form