# Data-Dependent Differentially Private Parameter Learning for Directed Graphical Models

###### Abstract

Directed graphical models (DGMs) are a class of probabilistic models that are widely used for predictive analysis in sensitive domains, such as medical diagnostics. In this paper we present an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. Our solution optimizes for the utility of inference queries over the DGM and adds noise that is customized to the properties of the private input dataset and the graph structure of the DGM. To the best of our knowledge, this is the first explicit data-dependent privacy budget allocation algorithm for DGMs. We compare our algorithm with a standard data-independent approach over a diverse suite of DGM benchmarks and demonstrate that our solution requires a privacy budget that is 3\times smaller to obtain the same or higher utility.

## 1 Introduction

Directed graphical models (DGMs) are widely used in causal reasoning and predictive analytics where prediction interpretability is desirable [32]. A typical use case of these models is in answering “what-if” queries over domains that work with sensitive information. For example, DGMs are used in medical diagnosis for answering questions like what is the most probable disease given a set of symptoms [43]. In learning such models, it is common that the model’s graph structure is publicly known. For example, in the case of a medical data set the dependencies between several physiological symptoms and diseases are well established and standardized. However, the parameters of the model have to be learned from observations that correspond to the random variables of the model. These observations may contain sensitive information, as in the case of medical applications; thus, learning and publicly releasing the parameters of the probabilistic model may lead to privacy violations [46, 59]. This establishes the need for privacy-preserving learning mechanisms for DGMs.

In this paper, we focus on the problem of privacy-preserving learning of the parameters of a DGM. For our privacy measure, we use differential privacy (DP) [17] – the de-facto standard for privacy. We consider the case when the structure of the target DGM is publicly known and the parameters of the model are learned from fully observed data. In this case, all parameters can be estimated via counting queries over the input observations (also referred to as data set in the remainder of the paper). One way to ensure privacy is to add suitable noise to the observations using the standard Laplace mechanism [17]. Unfortunately, this method is data-independent, i.e., the noise added to the base observations is agnostic to properties of the input data set and the structure of the DGM, and degrades utility. To address this issue, we turn to data-dependent methods, i.e., methods in which the added noise is customized to the properties of the input data sets [37, 62, 10, 2, 56, 55, 33].

We propose a data-dependent, \epsilon-DP algorithm for learning the parameters of a DGM over fully observed data. Our goal is to minimize errors in arbitrary inference queries that are subsequently answered over the learned DGM. We outline our main contributions as follows:

(1) Explicit data-dependent privacy-budget allocation: Our algorithm computes the parameters of the conditional probability distribution of each random variable in the DGM via separate measurements from the input data set; this gives rise to an opportunity to optimize the privacy budget allocation across the different variables with the objective of reducing the error in inference queries. We formulate this objective in a data-dependent manner; the objective is informed by both the private input data set and the public graph structure of the DGM. To the best of our knowledge this is the first paper to propose explicit data-dependent privacy-budget allocation for DGMs. We evaluate our algorithm on four benchmark DGMs. We demonstrate that our scheme only requires a privacy budget of \epsilon=1.0 to yield the same utility that a standard data-independent method achieves with \epsilon=3.0.

(2) New theoretical results: To preserve privacy we add noise to the parameters of the DGM. To understand how this noise propagates to inference queries, we provide two new theoretical results on the upper and lower bound of the error of inference queries. The upper bound has an exponential dependency on the treewidth of the input DGM while the lower bound depends on its maximum degree. We also provide a formulation to compute the sensitivity [35] of the parameters associated with a node of a DGM targeting only the probability distribution of its child nodes only.

## 2 Background

We review basic background material for the problems and techniques discussed in this paper.

Directed Graphical Models: A directed graphical model (DGM) or a Bayesian network is a probabilistic model that is represented as a directed acyclic graph \mathcal{G}. The nodes of the graph represent random variables and the edges encode conditional dependencies between the variables. The graphical structure of the DGM represents a factorization of the joint probability distribution of these random variables. Specifically, given a DGM with graph \mathcal{G}, let X_{1},\ldots,X_{n} be the random variables corresponding to the nodes of \mathcal{G} and X_{pa_{i}} denote the set of parents in \mathcal{G} for the node corresponding to variable X_{i}. The joint probability distribution factorizes as P[X_{1},\ldots,X_{n}]=\prod_{i=1}^{n}P[X_{i}|X_{pa_{i}}] where each factor P[X_{i}|X_{pa_{i}}] corresponds to a conditional probability distribution (CPD). For DGMs with discrete random variables, each CPD can be represented as a table of parameters \Theta_{x_{i}|x_{pa_{i}}} where each parameter corresponds to a conditional probability and x_{i} and x_{pa_{i}} denote variable assignments X_{i}=x_{i} and X_{pa_{i}}=x_{pa_{i}}.

A key task in DGMs is parameter learning. Given a DGM with known graph structure \mathcal{G}, the goal of parameter learning is to estimate each \Theta_{x_{i}|x_{pa_{i}}}, a task solved via maximum likelihood estimation (MLE). In the presence of fully observed data \mathcal{D} over the random variables of \mathcal{G} ^{1}^{1}1The attributes of the dataset become the nodes of the DGM’s graph. For the remainder of the paper we use them interchangeably depending on the context., the maximum likelihood estimates of the CPD parameters take the closed-form [32]:

\small\Theta_{x_{i}|x_{pa_{i}}}=C[x_{i},x_{pa_{i}}]/C[x_{pa_{i}}] | (1) |

where C[x_{i}] denotes the number of records in \mathcal{D} such that X_{i}=x_{i}.

After learning, the fully specified DGM can be used to answer inference queries, i.e., queries that seek to compute the marginal or conditional probabilities of certain events (variables) of interest. Inference queries can also involve evidence, in which case the assignment of a subset of nodes is fixed. Inference queries are of three types: (1) marginal inference, (2) conditional inference, and (3) maximum a posteriori (MAP) inference. We refer the reader to Appendix A.1 for more details.

For DGMs, inference queries can be answered exactly by the variable elimination (VE) algorithm [32], which is described in detail in Appendix 8.1.1. The basic idea is that we "eliminate" one variable at a time following a predefined order \prec over the graph nodes. Let \Phi denote a set of probability factors \phi (initialized with all the CPDs of the DGM) and Z denote the variable to be eliminated. First, all probability factors involving Z are removed from \Phi and multiplied together to generate a new product factor. Next Z is summed out from this combined factor generating a new factor \boldsymbol{\phi} that is entered into \Phi. VE corresponds to repeated sum-product computations: \boldsymbol{\phi}=\sum_{Z}\prod_{\phi\in\Phi}\phi.

Differential Privacy: We define differential privacy as follows:

###### Definition 2.1.

An algorithm \mathcal{A} satisfies \epsilon-differential privacy (\epsilon-DP), where \epsilon>0 is a privacy parameter, iff for any two datasets D and D^{\prime} that differ in a single record, we have

\small\forall t\in Range(\mathcal{A}),P\big{[}\mathcal{A}(D)=t\big{]}\leq e^{% \epsilon}P\big{[}\mathcal{A}(D^{\prime})=t\big{]} | (2) |

A result by [47, 31, 39] shows that the privacy guarantee of a differentially private algorithm can be amplified by a preceding sampling step. Let \mathcal{A} be an \epsilon-DP algorithm and \mathcal{D} is an input data set. Let \mathcal{A}^{\prime} be an algorithm that runs \mathcal{A} on a random subset of \mathcal{D} obtained by sampling it with probability \beta.

###### Lemma 2.1.

Algorithm \mathcal{A}^{\prime} will satisfy \epsilon^{\prime}-DP where \epsilon^{\prime}=ln(1+\beta(e^{\epsilon}-1))

## 3 Data-Dependent Differentially Private Parameter Learning for DGMs

We start with an overview of differntially private learning for DGMs over fully observed data and our data-dependent algorithm for \epsilon-DP. We then present our algorithm in detail.

### 3.1 Problem and Approach Overview

Let \mathcal{D} be a sensitive data set of size m with attributes \mathcal{X}=\langle X_{1},X_{2},\cdots,X_{n}\rangle and let \mathcal{N} be the DGM of interest. The graph structure \mathcal{G} of \mathcal{N} defined over the attribute set \mathcal{X} is publicly known. Our goal is to learn the parameters \Theta, i.e., the CPDs of \mathcal{N}, in a data-dependent differentially private manner from \mathcal{D}. In addition to guaranteeing \epsilon-DP, we seek to optimize the way noise is introduced in the learned parameters such that the error in inference queries over the \epsilon-DP DGM is minimized. Errors are defined by comparing the inference results over the \epsilon-DP DGM to the inference results obtained over the MLE DGM with no noise injection.

We build upon two observations to solve the above problem:

(1) Parameters \Theta[X_{i}|X_{{pa}_{i}}] of DGM \mathcal{N} can be estimated separately via counting queries over the empirical marginal table (joint distribution P_{\mathcal{N}}[X_{i},X_{{pa}_{i}}]) of the attribute set X_{i}\cup X_{pa_{i}}.

(2) The factorization over \mathcal{N} decomposes the overall \epsilon-DP learning problem into a series of separate \epsilon-DP learning sub-problems (one for each CPD). As a result the total privacy budget has to be divided among these sub-problems. However, owing to the structure of the graph and the data set, certain nodes will have more impact on an inference query than others. Hence, allocating more budget (and thereby getting better accuracy) to these nodes will result in reduced overall error for the inference queries. Thus, careful privacy budget allocation across the marginal tables can lead to better utility guarantees as compared to a naive budgeting scheme that assigns equal budget to all marginal tables.

The above observations are key to our proposed algorithm, which computes the parameters of \mathcal{N} from their respective marginal tables with data-dependent noise injections to guarantee \epsilon-DP. Our method is outlined in Algorithm 2 and proceeds in two stages. In the first stage, we obtain preliminary noisy measurements of the parameters of \mathcal{N}, which are used along with graph specific properties (i.e., the height and out-degree of each node) to formulate a data-dependent optimization objective for privacy budget allocation. The solution of this objective is then used in the second stage to compute the final parameters. In summary, if \epsilon^{B} is the total privacy budget available, we spend \epsilon^{I} of it to obtain preliminary parameter measurements in Stage I and the remaining \epsilon^{B}-\epsilon^{I} is used for the final parameter computation in Stage II, after optimal allocation across the marginal tables.

Next we describe our algorithm in detail and highlight how we address two core technical challenges in our approach: (1) how to reduce the privacy budget cost for the first stage \epsilon^{I} (equivalently increase \epsilon^{B}-\epsilon^{I}) (see Alg. 2, Lines 1-3), and (2) what properties of the input dataset and graph should the optimization objective be based on (see Alg. 2, Lines 5-10).

### 3.2 Algorithm Description

We now describe the two stages of our core method in turn:

Stage I – Formulation of optimization objective: First, we handle the trade-off between the two parts of the total privacy budget \epsilon^{I} and \epsilon^{B}-\epsilon^{I}. While one wants to maximize \epsilon^{B}-\epsilon^{I} to reduce the amount of noise in the final parameters, sufficient budget \epsilon^{I} is required to obtain good estimates of the statistics of the data set that are required to form the data-dependent budget allocation objective. To handle this trade-off, we use Lemma 2.1 to improve the accuracy of the optimization objective computation (Alg. 2, Lines 1-2). This allows us to assign a relatively low value to \epsilon^{I} so that our budget for the final parameter computation is increased. Recall from Lemma 2.1 that if we perform a preceding sampling step before a \epsilon-DP mechanism, then the overall mechanism is \epsilon^{\prime}-DP where \epsilon^{\prime}<\epsilon. Hence, for large values of m and sampling rate \beta, \small\sqrt{\frac{\beta(1-\beta)}{m}}+\frac{\sqrt{2}}{\epsilon}<\frac{\sqrt{2}% }{\epsilon^{\prime}} where LHS is the error for the above mechanism while RHS is the error of a standard \epsilon^{\prime}-DP mechanism (assuming Laplace mechanism – Appendix A.1.2).

Next, we estimate the parameters \hat{\Theta} on the aforementioned sampled data set D^{\prime} via the procedure ComputeParameters (described below and outlined in Algorithm 1) using budget allocation \mathpzc{E} (Alg. 2, Lines 3-4). Note that, \hat{\Theta} is only required for the optimization objective formulation and is different from the final parameters \tilde{\Theta} (Alg. 2, Line 13). Hence, for \hat{\Theta} we use a naive allocation of policy of equal privacy budget for all tables by using ComputeParameters (Alg. 2, Line 4).

Finally, lines 5-12 of Algorithm 2 correspond to the estimation of the privacy budget optimization objective \mathcal{F}_{\mathcal{D},\mathcal{G}} that depends on the data set \mathcal{D} and graph structure \mathcal{G} and is detailed in Section 3.3.

Stage II – Final parameter computation: We solve for the optimal privacy budget allocation \mathpzc{E}^{*} from \mathcal{F}_{\mathcal{D},\mathcal{G}} and compute the final parameters \tilde{\Theta} using ComputeParameters (Alg. 2, Lines 13-15).

Procedure ComputeParameters: This procedure is outlined in Algorithm 1. Its goal is, given a privacy budget allocation \mathpzc{E}, to derive the parameters of \mathcal{N} in a differentially private manner. To this end, the algorithm first materializes the table for a subset of attributes X_{i}\cup X_{pa_{i}},i\in[n] (Alg. 1, Line 2), and then injects noise drawn from Lap(\frac{1}{\mathpzc{E}[i]}) into each of its cells (Alg. 1, Line 3) to generate \tilde{T}_{i} [17]. Since each entry of T_{i} corresponds to a counting query, the sensitivity is 1. Next, it converts \tilde{T}_{i} to a marginal table \tilde{M}_{i}, i.e., joint distribution P_{\mathcal{N}}[X_{i},X_{pa_{i}}] (Alg. 1, Line 4). This is followed by ensuring that all \tilde{M}_{i}s are mutually consistent on all subsets of attributes using the method described in Appendix A.2.1. Finally, the algorithm derives \tilde{\Theta}[X_{i}|X_{pa_{i}}] (the noisy estimate of P_{\mathcal{N}}[X_{i}|X_{pa_{i}}]) from \tilde{M_{i}} (Lines 7-10). Note that deriving the marginal table \tilde{M}_{i} from \tilde{T}_{i} instead of directly computing it results in lower error (since the sensitivity of P_{\mathcal{N}}[X_{i}|X_{pa_{i}}] is 2).

### 3.3 Optimal Privacy Budget Allocation

Our goal is to find the optimal privacy budget allocation over the marginal tables, \tilde{M}_{i},i\in[n] for \mathcal{N} such that the error in the subsequent inference queries on \mathcal{N} is minimized.

Observation I: A more accurate estimate of the parameters of \mathcal{N} will result in better accuracy for the subsequent inference queries. Hence, we focus on reducing the total error of the parameters of \mathcal{N}. From Eq (1) and our Laplace noise injection (Alg. 1, Line 3), for a privacy budget of \epsilon, the value of a parameter of the DGM computed from the noisy marginal tables \tilde{M}_{i} is expected to be

\small\tilde{\Theta}[x_{i}|x_{pa_{i}}]=\Big{(}{C[x_{i},x_{pa_{i}}]\pm\frac{% \sqrt{2}}{\epsilon}}\Big{)}\Big{/}\Big{(}{C[x_{pa_{i}}]\pm\frac{\sqrt{2}}{% \epsilon}}\Big{)}\Big{[}\text{Follows from the Laplace noise added }\Big{]} | (3) |

From the rules of standard error propagation [20], the error in \Theta[x_{i},x_{pa_{i}}] is

\delta_{\Theta[x_{i},x_{pa_{i}}]}=\Theta[x_{i},x_{pa_{i}}]\sqrt{\frac{2}{(% \epsilon\cdot C[x_{pa_{i}}])^{2}}+\frac{2}{(\epsilon\cdot C[x_{i},x_{pa_{i}}])% ^{2}}}=\frac{\sqrt{2}\Theta[x_{i},x_{pa_{i}}]}{\epsilon}\cdot\sqrt{\frac{1}{C[% x_{pa_{i}}]^{2}}+\frac{1}{C[x_{i},x_{pa_{i}}]^{2}}} |
(4) |

where C[x_{i}] denotes the number of records in \mathcal{D} with X_{i}=x_{i}. Thus the mean error for the parameters of X_{i} is \delta_{i}=\frac{1}{|dom(X_{i}\cup X_{pa_{i}})|}\sum_{x_{i},x_{pa_{i}}}\delta_% {\Theta[x_{i}|x_{pa_{i}}]}=\frac{\sqrt{2}}{\epsilon|dom(X_{i}\cup X_{pa_{i}})|% }\cdot\sum_{x_{i},x_{pa_{i}}}\Theta[x_{i}|x_{pa_{1}}]\sqrt{\frac{1}{C[x_{pa_{i% }}]^{2}}+\frac{1}{C[x_{i},x_{pa_{i}}]^{2}}} where dom(S) is the domain of the attribute set S\subset\mathcal{X}. Since using the true values of the counts C[x] would lead to privacy violation, our algorithm uses the noisy estimates from \hat{T_{i}} (Alg. 2, Line 8).

Observation II: Depending on the data set and the graph structure, different nodes will have different impact on the final inference queries. This information can be captured by a corresponding weighting coefficient W[i],i\in[n] for each node.

Let \epsilon_{i} denote the optimal privacy budget for node X_{i}. Thus from the above observations, we formulate our optimization objective \mathcal{F}_{\mathcal{D},\mathcal{G}} as a weighted sum of the parameter error as follows

\mbox{minimize }\sum_{i=1}^{n-1}W[i]\cdot\frac{\tilde{\delta}_{i}}{\epsilon_{i% }}+W[n]\cdot\frac{\tilde{\delta}_{n}}{\epsilon^{B}-\epsilon^{I}-\sum_{i=1}^{n-% 1}\epsilon_{i}},\epsilon_{i}>0\thinspace\thinspace\forall i\in[n] |
(5) | ||

\tilde{\delta}_{i}=\frac{1}{|dom(X_{i}\cup X_{pa_{i}})|}\sum_{x_{i},x_{pa_{i}}% }\hat{\Theta[x_{i}|x_{pa_{i}}]}\sqrt{\frac{1}{\hat{T}[x_{pa_{i}}]^{2}}+\frac{1% }{\hat{T}[x_{i},x_{pa_{i}}]^{2}}},\hat{T}[x_{i},x_{pa_{i}}],\hat{T}[x_{pa_{i}}% ]\in\hat{T}_{i} |
(6) | ||

W[i]=(h_{i}+1)\cdot(o_{i}+1)\cdot(\tilde{\Delta}^{\mathcal{N}}_{i}+1) |
(7) |

where h_{i} is the height of the node X_{i}, o_{i} is the out-degree of X_{i}, \Delta^{\mathcal{N}}_{i} is the sensitivity [35] of the parameters of X_{i}, \frac{\tilde{\delta}_{i}}{\epsilon_{i}} gives the measure for estimated mean error for the parameters (CPD) of X_{i} and the denominator of the last term captures the linear constraint \sum_{i=1}^{n}\epsilon_{i}=\epsilon^{B}-\epsilon^{I}.

Computation of weighting coefficient W[i]: For a given node X_{i}, the weighting coefficient W[i] is computed from the following three node features:

(1) Height of the node h_{i}: The height of a node X_{i} is defined as the length of the longest path between X_{i} and a leaf node. From the VE algorithm, it is intuitive that when a node with a large height is eliminated early in the order, its errors can affect the computation of nodes with low height.

(2) Out-degree of the node o_{i}: A node causally affects all its children nodes. Thus the impact of a node, with high fan-out degree, on inference queries will be more than say a leaf node.

(3) Sensitivity \Delta^{\mathcal{N}}_{i}: Sensitivity of a parameter in a DGM measures the impact of small changes in the parameter value on a target probability. Laskey [35] proposed a method of computing the sensitivity by using the partial derivative of output probabilities with respect to the parameter being varied. However, previous work have mostly focused on the target probability to be a joint distribution of all the variables. In this paper we present a method to compute sensitivity targeting the probability distribution of child nodes only. Let \Delta^{\mathcal{N}}_{i} denote the mean sensitivity of the parameters of X_{i} on target probabilities of all the nodes in Child(X_{i})= \{\text{set of all the child nodes of {$X_{i}$}}\}. Formally,

\Delta^{\mathcal{N}}_{i}=\frac{1}{|dom(X_{i}\cup X_{pa_{i}})|}\sum_{x_{i},x_{% pa_{i}}}\frac{1}{|Child(X_{i})|}\sum_{Y\in Child(X_{i})}\frac{1}{|dom(Y)|}\sum% _{y}\pdv{P_{\mathcal{N}}[Y=y]}{\Theta[x_{i}|x_{pa_{i}}]} |
(8) |

Observe that a node X_{i} can affect another node Y only iff Y\in\mathpzc{P}(X_{i}) (its Markov blanket–Defn. A.1). However due to the factorization of the joint distribution of all random variables of a DGM, \forall Y\in\mathpzc{P}(X_{i}),Y\not\in Child(X_{i}),P_{\mathcal{N}}[Y] can be expressed without \Theta[x_{i}|x_{pa_{i}}]. Thus just computing the mean sensitivity of the parameters over the set of child nodes \Delta^{\mathcal{N}}_{i} turns out to be a good weighting metric for our setting. \Delta^{\mathcal{N}}_{i} for leaf nodes is thus 0. Note that \Delta^{\mathcal{N}}_{i} is distinct from the notion of sensitivity of a function used in Laplace mechanism (Appendix A.1.2).

Computing Sensitivity \Delta^{\mathcal{N}}_{i}: Let Y\in Child(X_{i}) and \Gamma(Y)=\{\mathbf{Y}_{1},\cdots,\mathbf{Y}_{t}\},t<n denote the set of all nodes \{\mathbf{Y}_{i}\} such that there is a directed path from \mathbf{Y}_{i} to Y. Basically \Gamma(Y) denotes the set of ancestors of Y in \mathcal{G}. From the factorization of \mathcal{N} it is easy to see that from the conditional probability distributions of \mathcal{N} of the nodes in \Gamma(Y)\cup Y it is sufficient to compute P_{\mathcal{N}}[Y] as follows

\small P_{\mathcal{N}}[\mathbf{Y}_{1},\cdots,\mathbf{Y}_{t},Y]=P_{\mathcal{N}}% [Y|Y_{pa}]\cdot\prod_{\mathbf{Y}_{i}\in\Gamma(Y)}P_{\mathcal{N}}[\mathbf{Y}_{i% }|\mathbf{Y}_{pa_{i}}],P_{\mathcal{N}}[Y]=\sum_{\mathbf{Y}_{1}}...\sum_{% \mathbf{Y}_{t}}P_{\mathcal{N}}[\mathbf{Y}_{1},\cdots,\mathbf{Y}_{t},Y] |

Therefore, using our noisy preliminary parameter estimates (Alg 2, Line 4), we compute

{\pdv{\tilde{P}_{\mathcal{N}[Y=y]}}{\Theta[x_{i}|x_{pa_{i}}]}}=\sum_{\mathbf{Y% }_{i}\in\Gamma(Y)}\Big{(}\prod_{\mathbf{Y}_{i}\in\Gamma(Y)}\hat{\Theta}[y_{i}|% y_{pa_{i}}]\cdot\zeta_{x_{i},x_{pa_{i}}}(\mathbf{Y}_{i}=y_{i},\mathbf{Y}_{pa_{% i}}=y_{pa_{i}})\Big{)}\cdot\hat{\Theta}[Y=y|y_{pa}]\cdot\zeta_{x_{i},x_{pa_{i}% }}(Y_{pa}=y_{pa}) |

\small\zeta_{x_{i},x_{pa_{i}}}(Z_{1}=z_{1},\cdots,Z_{t}=z_{t})=\left\{\begin{% array}[]{ll}1\hskip 14.226378pt\mbox{ if }\bigcup_{i=1}^{t}Z_{i}\bigcap\{X_{i}% \cup X_{pa_{i}}\}=\varnothing\\ 1\hskip 14.226378pt\mbox{ if }\forall Z_{i},Z_{i}\in\{X\cup X_{pa_{i}}\}% \Rightarrow z_{i}\in\{x,x_{pa_{i}}\}\\ 0\hskip 14.226378pt\mbox{ otherwise}\end{array}\right. | (9) |

where \zeta_{x_{i}|x_{pa_{i}}} ensures that only the product terms \prod_{\mathbf{Y}_{i}\in\Gamma(Y)}\hat{\Theta}[y_{i}|y_{pa_{i}}] involving parameters with attributes in \{X_{i},X_{pa_{i}},Y\} that match up with the corresponding values in \{x_{i},x_{pa_{i}},y\} are retained in the computation (as all others terms have partial derivative 0). Thus the noisy mean sensitivity estimate for the parameters of node X_{i}, \tilde{\Delta}^{\mathcal{N}}_{i} can be computed from Eq. (8) and (9) (Alg. 2, Line 9).

For example, for the DGM given by Figure 2 for node A (assuming binary attributes for simplicity) we need to compute the sensitivity of its parameters on the target probability of C, i.e., \pdv{P[C=0]}{\Theta[A=0]},\pdv{P[C=0]}{\Theta[A=1]},\pdv{P[C=1]}{\Theta[A=1]} and \pdv{P[C=1]}{\Theta[A=0]} which is computed as \pdv{\tilde{P}[C=0]}{\Theta[A=0]}=\hat{\Theta}[B=0]\hat{\Theta}[C=0|A=0,B=0] +\hat{\Theta}[B=1]\hat{\Theta}[C=0|A=0,B=1]. The rest of the partial derivatives are computed in a similar manner to give us \Delta^{\mathcal{N}}_{A}=\frac{1}{4}\Big{(}\pdv{P[C=0]}{\Theta[A=0]}+\pdv{P[C=% 0]}{\Theta[A=1]}+\pdv{P[C=1]}{\Theta[A=1]}+\pdv{P[C=1]}{\Theta[A=0]}\Big{)}.

Thus the weighting coefficient W[i] is defined as the product of the aforementioned three features and given by Eq. (7). The extra additive term 1 is used to handle leaf nodes so that the weighting coefficients are non-zero. Assuming c_{i}=\frac{1}{W[i]\cdot\tilde{\delta}_{i}},i\in[n], \mathcal{F}_{\mathcal{D},\mathcal{G}} has a closed form solution as follows

\small\epsilon^{*}_{i}=\frac{(\epsilon^{B}-\epsilon^{I})\prod_{j=1,j\neq i}^{n% }\sqrt{c_{j}}}{\sum_{j\in[n]}\prod_{l=1,l\neq j}^{n}\sqrt{c_{l}}},i\in[n-1],% \epsilon^{*}_{n}=\epsilon^{B}-\epsilon^{I}-\sum_{i=1}^{n-1}\epsilon^{*}_{i} | (10) |

Discussion: Note that there are two paradigms of information to be considered for a DGM - (1) graph structure \mathcal{G} (2) data set \mathcal{D}. h_{i} and o_{i} are purely graph characteristics and they summarise the graphical properties of the node X_{i}. \tilde{\Delta}^{\mathcal{N}}_{i} captures the interactions of the graph structure with the actual parameter values thereby encoding the data set dependent information. Hence we theorize that the aforementioned three features are sufficient for constructing the weighting coefficients.

### 3.4 Privacy Analysis

###### Theorem 3.1.

The proposed algorithm (Algorithm 2) for learning the parameters of a directed graphical model over fully observed data is \epsilon^{B}-differentially private.

###### Proof.

The sensitivity of counting queries is 1. Hence, the computation of the noisy tables \tilde{T}_{i} (Alg 1, Line 2-3) is a straightforward application of Laplace mechanism (Section A.1.2). This together with Lemma 2.1 makes the computation of \tilde{T}_{i}, \epsilon^{I}-DP. Now the subsequent computation of the optimal privacy budget allocation \mathpzc{E}^{*} is a post-processing operation on \tilde{T}_{i} and hence by Theorem A.4 is still \epsilon^{I}-DP. The final parameter computation is clearly (\epsilon^{B}-\epsilon^{I})-DP. Thus by the theorem of sequential composition (Theorem A.3), Algorithm 2 is \epsilon^{B}-DP. The DGM thus learned can be released publicly and any inference query run on it will still be \epsilon-DP (from Theorem A.4). ∎

## 4 Error Analysis for Inference Queries

As discussed in Section 3, our optimization objective minimizes a weighted sum of the parameter errors. To understand how the error propagates from the parameters to the inference queries, we present two general results bounding the error of a sum-product term of the VE algorithm, given the errors in the factors.

###### Theorem 4.1.

[Lower Bound] For a DGM \mathcal{N}, for any sum-product term of the form \small\boldsymbol{\phi}_{\mathpzc{A}}=\sum_{x}\prod_{i=1}^{t}\phi_{i},t\in\{2,% \cdots,\eta\} in the VE algorithm, we have

\small\delta_{\phi_{\mathpzc{A}}}\geq\sqrt{\eta-1}\cdot\delta^{min}_{\phi_{i}[% a,x]}(\phi^{min}_{i}[a,x])^{\eta-2} | (11) |

where X is the attribute being eliminated, \small Attr(\phi) is the set of attributes in \phi, \small\mathpzc{A}=\bigcup_{\phi_{i}}\{Attr(\phi_{i})\}/X,x\in dom(X),a\in dom(% \mathpzc{A}),\delta^{min}_{\phi_{i}[a,x]}=min_{i,a,x}\{\delta_{\phi_{i}[a,x]}% \},\phi^{min}_{i}[a,x]=min_{i,a,x}\{\phi_{i}[a,x]\}\text{and }\eta=\max_{X_{i}% }\{\text{in-degree}(X_{i})+ \small\text{out-degree}(X_{i})\}+1.

###### Theorem 4.2.

[Upper Bound] For a DGM \mathcal{N}, for any sum-product term of the form \small\boldsymbol{\phi}{A}=\sum_{x}\prod_{i=1}^{t}\phi_{i}, t\in\{2,\cdots,n\} in the VE algorithm with the optimal elimination order, we have

\small\delta_{\phi{A}}\leq 2\cdot\eta\cdot d^{\kappa}\delta^{max}_{\phi_{i}[a,% x]} |

where X is the attribute being eliminated, \kappa is the treewidth of \mathcal{G}, d is the maximum domain size of an attribute, \small Attr(\phi) is the set of attributes in \phi, \small\eta=\max_{X_{i}}\{\text{in-degree}(X_{i})+\text{out-degree}(X_{i})\}+1, \small\mathpzc{A}=\bigcup_{i}^{t}\{Attr(\phi_{i})\}/X,a\in dom(\mathpzc{A}),x% \in dom(X) and \small\delta^{max}_{\phi_{i}[a,x]}=\max_{i,a,x}\{\delta_{\phi_{i}[a,x]}\}

For proving the lower bound, we introduce a specific instance of the DGM based on Lemma A.1. For the upper bound, with the optimal elimination order of the VE algorithm, the maximum error has an exponential dependency on the treewidth \kappa. This is very intuitive as even the complexity of the VE algorithm has the same dependency on \kappa. The answer of a marginal inference query is the factor generated from the last sum-product term. Also, since the initial set of \phi_{i}s for the first sum-product term computation are the actual parameters of the DGM, all the errors in the subsequent intermediate factors and hence the inference query itself can be bounded by functions of parameter errors using the above theorems.

## 5 Evaluation

We now evaluate the utility of the DGM learned under differential privacy using our algorithm. Specifically, we study the following three questions: (1) Does our scheme lead to low error approximation of the DGM parameters? (2) Does our scheme result in low error inference query responses? (3) How does our scheme fare against data-independent approaches?

Evaluation Highlights: First, focusing on the parameters of the DGM, we find that our scheme achieves low L1 error (at most 0.2 for \epsilon=1) and low KL divergence (at most 0.13 for \epsilon=1) across all test data sets. Second, we find that for marginal and conditional inferences, our scheme provides low L1 error and KL divergence (both around 0.05 at max for \epsilon=1) for all test data sets. Our scheme also provides high accuracy for MAP queries (93.8\% accuracy for \epsilon=1 averaged over all test data sets). Finally, our scheme achieves strictly better utility than the data-independent baseline; our scheme only requires a privacy budget of \epsilon=1.0 to yield the same utility that the data-independent baseline achieves with \epsilon=3.0.

### 5.1 Experimental Setup

Data sets: We evaluate our proposed scheme on four benchmark DGMs [8] –

(1) Asia: Number of nodes – 8; Number of arcs – 8; Number of parameters – 18

(2) Sachs: Number of nodes – 11; Number of arcs – 17; Number of parameters – 178

(3) Child: Number of nodes – 20; Number of arcs – 25; Number of parameters – 230

(4) Alarm: Number of nodes – 37; Number of arcs – 46; Number of parameters – 509

For all four DGMs, the evaluation is carried out on corresponding synthetic data sets [12, 13] with 10,000 records each.

Metrics: For conditional and marginal inference queries we compute the following two metrics: L1-error, \delta_{L1}=\sum_{x,y}|P[x|y]-\tilde{P}[x|y]| and KL divergence, D_{KL} = \sum_{x,y}\tilde{P}[x|y]ln\Big{(}\frac{\tilde{P}[x|y]}{P[x|y]}\Big{)} where P[x|y] denotes either a true CPD of the DGM (parameter) or a true marginal/conditional inference query response and \tilde{P}[x|y] is the corresponding noisy estimate obtained from our proposed scheme. For answering MAP inferences, we compute \rho=\frac{\text{\# Correct answers}}{\text{\#Total runs}}.

Setup: We evaluate each data set on 20 random inference queries (10 marginal inference, 10 conditional inference) and report mean error over 10 runs. For MAP queries, we run 20 random queries and report the mean result over 10 runs. The queries are of the form P[X|Y] where attribute subsets X and Y are varied from being singletons up to the full attribute set. We compare our results with a standard data-independent baseline (denoted by D-Ind) [63, 61] which corresponds to executing Algorithm 1 on the entire input data set \mathcal{D} and the privacy budget array \mathpzc{E}=[\frac{\epsilon^{B}}{n},\cdots,\frac{\epsilon^{B}}{n}]. All the experiments have been implemented in Python and we set e^{I}=0.1\cdot e^{B},\beta=0.1.

### 5.2 Experimental Results

Figure 1 shows the mean \delta_{L1} and D_{KL} for noisy parameters and marginal and conditional inferences for the data sets Sachs and Child. The main observation is that our scheme achieves strictly lower error than that of D-Ind; specifically our scheme only requires a privacy budget of \epsilon=1.0 to yield the same utility that D-Ind achieves with \epsilon=3.0. In most practical scenarios, the value of \epsilon typically does not exceed 3 [29]. In Table 1, we present our experimental results for MAP queries. We see that our scheme achieves higher accuracy. For example, our scheme provides an accuracy of at least 86\% while D-Ind achieves 81\% accuracy for \epsilon=1. Finally, given a marginal inference query \mathcal{Q}, we compute the scale normalized error in \mathcal{Q} as \mu=\frac{\delta_{L1}[\mathcal{Q}]-LB}{UB-LB} where UB and LB are the upper and lower bound respectively computed using Theorem 4.2 and Theorem 4.1^{2}^{2}2UB and LB are computed separately for each run of the experiment from their respective empirical parameter errors.. Clearly, the lower the value of \mu is the closer it is to the lower bound and vice versa. We report the mean value of \mu for 20 random inference queries (marginal and conditional) for \epsilon=1 in Table 2. We observe that the errors are closer to their respective lower bounds. This is more prominent for the errors obtained from our data-dependent scheme than those of D-Ind.

\boldsymbol{\epsilon} | \boldsymbol{\rho} | |||||||

Asia | Sachs | Child | Alarm | |||||

D-Ind | Our | D-Ind | Our | D-Ind | Our | D-Ind | Our | |

Scheme | Scheme | Scheme | Scheme | Scheme | Scheme | Scheme | Scheme | |

1 | 0.88 | 1 | 0.81 | 0.86 | 0.79 | 0.93 | 0.89 | 0.95 |

1.5 | 0.93 | 1 | 0.87 | 0.93 | 0.83 | 0.95 | 0.92 | 0.98 |

2 | 1 | 1 | 0.92 | 0.98 | 0.89 | 0.97 | 0.95 | 1 |

2.5 | 1 | 1 | 0.96 | 1 | 1 | 1 | 1 | 1 |

3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

\boldsymbol{\mu} | ||||
---|---|---|---|---|

Asia | Sachs | Child | Alarm | |

D-Ind | 0.008 | 0.065 | 0.0035 | 0.0046 |

Scheme | ||||

Our | 0.0035 | 0.04 | 0.0014 | 0.0012 |

Scheme |

Thus we conclude that the non-uniform budget allocation in our data-dependent scheme gives better utility than an uniform budget allocation. For example, for a total privacy budget \epsilon^{B}=1 in DGM Sachs, our scheme assigns the highest budget (\epsilon=0.1) to node "PKC" which is the root with 5 child nodes and least budget to "Jnk" (\epsilon=0.004) which is a leaf node .

## 6 Related Work

In this section we review related literature. There has been a steadily growing amount work in differentially private machine learning models for the last couple of years. We list some of the most recent work in this line (not exhaustive list). [1, 53, 3] address the problem of differentially private SGD. The authors of [41] present an algorithm for differentially private expectation maximization. In [36] the problem of differentially private M-estimators is addressed. Algorithms for performing expected risk minimization under differential privacy has been proposed in [49, 9]. In [50] two differentially private subspace clustering algorithms are proposed.

There has been a fair amount of work in differentially private Bayesian inferencing and related notions [15, 51, 21, 63, 22, 7, 28, 60, 45, 40, 30, 5, 6, 19, 61]. In [28] the authors present a solution for DP Bayesian learning in a distributed setting, where each party only holds a subset of the data a single sample or a few samples of the data. In [19] the authors show that a data-dependent prior learnt under \epsilon-DP yields a valid PAC-Bayes bound. The authors in [52] show that probabilistic inference over differentially private measurements to derive posterior distributions over the data sets and model parameters can potentially improve accuracy. An algorithm to learn an unknown probability distribution over a discrete population from random samples under \epsilon-DP is presented in [14]. In [7] the authors present a method for private Bayesian inference in exponential families that learns from sufficient statistics. The authors of [51] and [15] show that posterior sampling gives differential privacy "for free" under certain assumptions. In [21] the authors show that Laplace mechanism based alternative for "One Posterior Sample" is as asymptotically efficient as non-private posterior inference, under general assumptions. A Rényi differentially private posterior sampling algorithm is presented in [22]. [60] proposes a differential private Naive Bayes classification algorithm for data streams. [63] presents algorithms for private Bayesian inference on probabilistic graphical models. In [40], the authors introduce a general privacy-preserving framework for Variational Bayes. An expressive framework for writing and verifying differentially private Bayesian machine learning algorithms is presented in [5]. The problem of learning discrete, undirected graphical models in a differentially private way is studied in [6]. [45] presents a general method for privacy-preserving Bayesian inference in Poisson factorization. In [63] the authors propose algorithms for private Bayesian inference on graphical models. However, their proposed solution does not add data-dependent noise. In fact their proposed algorithms (Algorithm 1 and Algorithm 2 as in [63]) are essentially the same in spirit as our baseline solution D-Ind. Moreover, some proposals from [63] can be combined with D-Ind; for example to ensure mutual consistency, [63] adds Laplace noise in the Fourier domain while D-Ind uses techniques of [26]. D-Ind is also identical (D-Ind has an additional consistency step) to an algorithm used in [61] which uses DGMs to generate high-dimensional data.

A number of data-dependent differentially private algorithms have been proposed in the past few years. [2, 56, 62, 55] outline data-dependent mechanisms for publishing histograms. In [10] the authors construct an estimate of the dataset by building differentially private kd-trees. MWEM [25] derives estimate of the dataset iteratively via multiplicative weight updates. In [37] differential privacy is achieved by adding data and workload dependent noise. [34] presents a data-dependent differentially private algorithm selection technique. [24, 18] present two general data-dependent differentially private mechanisms. Certain data-independent mechanisms attempt to find a better set of measurements in support of a given workload. One of the most prominent technique is the matrix mechanism framework [58, 38] which formalizes the measurement selection problem as a rank-constrained SDP. Another popular approach is to employ a hierarchical strategy [27, 11, 54]. [57, 4, 16, 23, 48, 26] propose techniques for marginal table release.

## 7 Conclusion

In this paper we have proposed an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. The noise added is customized to the private input data set as well as the public graph structure of the DGM. To the best of our knowledge, we propose the first explicit data-dependent privacy budget allocation mechanism for DGMs. Our solution achieves strictly higher utility than that of a standard data-independent approach; our solution requires at least 3\times smaller privacy budget to achieve the same or higher utility.

## References

- [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 308–318, New York, NY, USA, 2016. ACM.
- [2] G. Acs, C. Castelluccia, and R. Chen. Differentially private histogram publishing through lossy compression. In 2012 IEEE 12th International Conference on Data Mining, pages 1–10, Dec 2012.
- [3] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communication-efficient and differentially-private distributed sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7564–7575. Curran Associates, Inc., 2018.
- [4] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’07, pages 273–282, New York, NY, USA, 2007. ACM.
- [5] Gilles Barthe, Gian Pietro Farina, Marco Gaboardi, Emilio Jesus Gallego Arias, Andy Gordon, Justin Hsu, and Pierre-Yves Strub. Differentially private bayesian programming. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 68–79, New York, NY, USA, 2016. ACM.
- [6] Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, and Gerome Miklau. Differentially private learning of undirected graphical models using collective graphical models. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 478–487. JMLR.org, 2017.
- [7] Garrett Bernstein and Daniel R Sheldon. Differentially private bayesian inference for exponential families. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2919–2929. Curran Associates, Inc., 2018.
- [8] http://www.bnlearn.com/bnrepository/.
- [9] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069–1109, July 2011.
- [10] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, pages 20–31, Washington, DC, USA, 2012. IEEE Computer Society.
- [11] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE ’12, pages 20–31, Washington, DC, USA, 2012. IEEE Computer Society.
- [12] https://github.com/albertofranzin/data-thesis.
- [13] https://www.ccd.pitt.edu/wiki/index.php/data_repository.
- [14] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2566–2574. Curran Associates, Inc., 2015.
- [15] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin I. P. Rubinstein. Robust and private bayesian inference. In Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 291–305, Cham, 2014. Springer International Publishing.
- [16] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. Differentially private data cubes: Optimizing noise sources and consistency. In Test, pages 217–228, 01 2011.
- [17] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014.
- [18] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and differential privacy. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS ’10, pages 51–60, Washington, DC, USA, 2010. IEEE Computer Society.
- [19] Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential privacy. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8430–8441. Curran Associates, Inc., 2018.
- [20] https://en.wikipedia.org/wiki/propagation_of_uncertainty.
- [21] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 192–201, Arlington, Virginia, United States, 2016. AUAI Press.
- [22] Joseph Geumlek, Shuang Song, and Kamalika Chaudhuri. Rényi differential privacy mechanisms for posterior sampling, 2017.
- [23] Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. Privately releasing conjunctions and the statistical query barrier. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC ’11, pages 803–812, New York, NY, USA, 2011. ACM.
- [24] Anupam Gupta, Aaron Roth, and Jonathan Ullman. Iterative constructions and private data release. In Proceedings of the 9th International Conference on Theory of Cryptography, TCC’12, pages 339–356, Berlin, Heidelberg, 2012. Springer-Verlag.
- [25] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private data release. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, pages 2339–2347, USA, 2012. Curran Associates Inc.
- [26] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(1-2):1021–1032, Sep 2010.
- [27] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow., 3(1-2):1021–1032, September 2010.
- [28] Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma, and Antti Honkela. Differentially private bayesian learning on distributed data. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3226–3235. Curran Associates, Inc., 2017.
- [29] Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, Benjamin C. Pierce, and Aaron Roth. Differential privacy: An economic method for choosing epsilon. 2014 IEEE 27th Computer Security Foundations Symposium, Jul 2014.
- [30] Joonas Jälkö, Antti Honkela, and Onur Dikmen. Differentially private variational inference for non-conjugate models. CoRR, abs/1610.08749, 2016.
- [31] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, June 2011.
- [32] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
- [33] Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Pythia: Data dependent differentially private algorithm selection. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1323–1337, New York, NY, USA, 2017. ACM.
- [34] Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Pythia: Data dependent differentially private algorithm selection. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1323–1337, New York, NY, USA, 2017. ACM.
- [35] K. B. Laskey. Sensitivity analysis for probability assessments in bayesian networks. IEEE Transactions on Systems, Man, and Cybernetics, 25(6):901–909, June 1995.
- [36] Jing Lei. Differentially private m-estimators. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 361–369. Curran Associates, Inc., 2011.
- [37] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. A data- and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, Jan 2014.
- [38] Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor. Optimizing linear counting queries under differential privacy. In Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’10, pages 123–134, New York, NY, USA, 2010. ACM.
- [39] Ninghui Li, Wahbeh H. Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In 7th ACM Symposium on Information, Compuer and Communications Security, ASIACCS ’12, Seoul, Korea, May 2-4, 2012, pages 32–33, 2012.
- [40] Mijung Park, James R. Foulds, Kamalika Chaudhuri, and Max Welling. Variational bayes in private settings (vips). CoRR, abs/1611.00340, 2016.
- [41] Mijung Park, Jimmy Foulds, Kamalika Chaudhuri, and Max Welling. Dp-em: Differentially private expectation maximization, 2016.
- [42] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
- [43] Judea Pearl. Graphical Models for Probabilistic and Causal Reasoning, pages 367–389. Springer Netherlands, Dordrecht, 1998.
- [44] Wahbeh Qardaji, Weining Yang, and Ninghui Li. Priview: Practical differentially private release of marginal contingency tables. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1435–1446, New York, NY, USA, 2014. ACM.
- [45] Aaron Schein, Zhiwei Steven Wu, Mingyuan Zhou, and Hanna M. Wallach. Locally private bayesian inference for count models. CoRR, abs/1803.08471, 2018.
- [46] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, May 2017.
- [47] https://adamdsmith.wordpress.com/2009/09/02/sample-secrecy/.
- [48] Justin Thaler, Jonathan Ullman, and Salil Vadhan. Faster algorithms for privately releasing marginals. Lecture Notes in Computer Science, page 810–821, 2012.
- [49] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2722–2731. Curran Associates, Inc., 2017.
- [50] Yining Wang, Yu-Xiang Wang, and Aarti Singh. Differentially private subspace clustering. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1000–1008, Cambridge, MA, USA, 2015. MIT Press.
- [51] Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 2493–2502. JMLR.org, 2015.
- [52] Oliver Williams and Frank Mcsherry. Probabilistic inference and differential privacy. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2451–2459. Curran Associates, Inc., 2010.
- [53] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. Proceedings of the 2017 ACM International Conference on Management of Data - SIGMOD ’17, 2017.
- [54] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. Differential privacy via wavelet transforms. IEEE Trans. on Knowl. and Data Eng., 23(8):1200–1214, August 2011.
- [55] Y. Xiao, J. Gardner, and L. Xiong. Dpcube: Releasing differentially private data cubes for health information. In 2012 IEEE 28th International Conference on Data Engineering, pages 1305–1308, April 2012.
- [56] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In 2012 IEEE 28th International Conference on Data Engineering, pages 32–43, April 2012.
- [57] Grigory Yaroslavtsev, Cecilia M. Procopiuc, Graham Cormode, and Divesh Srivastava. Accurate and efficient private release of datacubes and contingency tables. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE ’13, pages 745–756, Washington, DC, USA, 2013. IEEE Computer Society.
- [58] Ganzhao Yuan, Zhenjie Zhang, Marianne Winslett, Xiaokui Xiao, Yin Yang, and Zhifeng Hao. Low-rank mechanism:optimizing batch queries under differential privacy. Proceedings of the VLDB Endowment, 5(11):1352–1363, Jul 2012.
- [59] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2016.
- [60] G. Zhang and S. Li. Research on differentially private bayesian classification algorithm for data streams. In 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA), pages 14–20, March 2019.
- [61] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst., 42(4):25:1–25:41, October 2017.
- [62] Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 587–595, 2014.
- [63] Zuhe Zhang, Benjamin I. P. Rubinstein, and Christos Dimitrakakis. On the differential privacy of bayesian inference. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2365–2371. AAAI Press, 2016.

## Appendix A Appendix

### A.1 Background Cntd.

#### A.1.1 Directed Graphical Models Cntd.

###### Definition A.1 (Markov Blanket).

The Markov blanket for a node X in a graphical model, denoted by \mathpzc{P}(X), is the set of all nodes whose knowledge is sufficient to predict X and hence shields X from rest of the network [42]. In a DGM, the Markov blanket of a node consists of its child nodes, parent nodes and the parents of its child nodes.

For example, in Figure 2, \mathpzc{P}(D)=\{C,E,F\}.

#### Learning Cntd.

As mentioned before in Section 2, for a fully observed DGM, the parameters (CPDs) are computed via maximum likelihood estimation (MLE). Let the data set be \mathcal{D}=\{x^{1},x^{2},\cdots,x^{m}\} with m i.i.d records with attribute set \langle X_{1},X_{2},\cdots,X_{n}\rangle. The likelihood is then given by

\displaystyle\mathcal{L}(\Theta,\mathcal{D})=\prod_{i=1}^{n}\prod_{j=1}^{m}% \Theta_{x_{i}^{j}|x_{pa_{i}}^{j}} | (12) |

where \Theta_{x_{i}^{j}|x_{pa_{i}}^{j}} represents the probability that X_{i}=x_{i}^{j} given X_{pa_{i}}=x_{pa_{i}}^{j}.

Taking logs and rearranging, this reduces to

\displaystyle logL(\Theta,\mathcal{D})=\sum_{i=1}^{n}\sum_{x_{pa_{i}}}\sum_{x_% {i}}C[x_{i},x_{pa_{i}}]\cdot\Theta[x_{i}|x_{pa_{i}}] | (13) |

where C[x_{i},x_{pa_{i}}] denotes the number of records in data set \mathcal{D} such that X_{i}=x_{i},X_{pa_{i}}=x_{pa_{i}}. Thus, maximization of the (log) likelihood function decomposes into separate maximizations for the local conditional distributions which results in the closed form solution

\displaystyle\Theta_{x_{i}|x_{pa_{i}}}=\frac{C[x_{i},x_{pa_{i}}]}{C[x_{pa_{i}}]} | (14) |

#### Inference Cntd.

There are three types of inference queries in general:

(1) Marginal inference: This is used to answer queries of the type "what is the probability of a given variable if all others are marginalized". An example marginal inference query is P[X_{n}=e]=\sum_{X_{1}}\sum_{X_{2}}\cdots\sum_{X_{n-1}}P[X_{1},X_{2},\cdots,X_%
{n}=e].

(2) Conditional Inference: This type of query answers the probability distribution of some variable conditioned on some evidence e. An example conditional inference query is P[X_{1}|X_{n}=e]=\frac{P[X_{1},X_{n}=e]}{P[X_{n}=e]}.

(3) Maximum a posteriori (MAP) inference: This type of query asks for the most likely assignment of variables. An example of MAP query is

\max_{X_{1},\cdots,X_{n-1}}\{P[X_{1},\cdots,X_{n-1},X_{n}=e]\} |

Variable Elimination Algorithm (VE): The complete VE algorithm is given by Algorithm 3. The basic idea of the variable elimination algorithm is that we "eliminate" one variable at a time following a predefined order \prec over the nodes of the graph. Let \Phi denote a set of probability factors which is initialized as the set of all CPDs of the DGM and Z denote the variable to be eliminated. For the elimination step, firstly all the probability factors involving the variable to be eliminated, Z are removed from \Phi and multiplied together to generate a new product factor. Next, the variable Z is summed out from this combined factor generating a new factor that is entered into \Phi. Thus the VE algorithm essentially involves repeated computation of a sum-product task of the form

\displaystyle\boldsymbol{\phi}=\sum_{Z}\prod_{\phi\in\Phi}\phi | (15) |

The complexity of the VE algorithm is defined by the size of the largest factor. Here we state two lemmas regarding the intermediate factors \boldsymbol{\phi} which will be used in Section A.3.

###### Lemma A.1.

###### Lemma A.2.

The size of the largest intermediary factor generated as a result of running of the VE algorithm on a DGM is at least equal to the treewidth of the graph [32].

###### Corollary.

The complexity of the VE algorithm with the optimal order of elimination depends on the treewidth of the graph.

#### A.1.2 Differential Privacy Cntd.

When applied multiple times, the differential privacy guarantee degrades gracefully as follows.

###### Theorem A.3 (Sequential Composition).

If \mathcal{A}_{1} and \mathcal{A}_{2} are \epsilon_{1}-DP and \epsilon_{2}-DP algorithms that use independent randomness, then releasing the outputs (\mathcal{A}_{1}(D),\mathcal{A}_{2}(D)) on database D satisfies (\epsilon_{1}+\epsilon_{2})-DP.

Another important result for differential privacy is that any post-processing computation performed on the noisy output of a differentially private algorithm does not cause any loss in privacy.

###### Theorem A.4 (Post-Processing).

Let \mathcal{A}:D\mapsto R be a randomized algorithm that is \epsilon-DP. Let f:R\mapsto R^{\prime} be an arbitrary randomized mapping. Then f\circ\mathcal{A}:D\mapsto R^{\prime} is \epsilon-DP.

Laplace Mechanism: In the Laplace mechanism, in order to publish f(D) where f:D\mapsto R, \epsilon-differentially private mechanism \mathcal{A} publishes f(D)+Lap\Big{(}\frac{\Delta f}{\epsilon}\Big{)} where \Delta f=\max_{D,D^{\prime}}||f(D)-f(D^{\prime})||_{1} is known as the sensitivity of the query. The pdf of Lap(b) is given by \mathbf{f}(x)={\frac{1}{2b}}e^{\left(-{\frac{|x-\mu|}{b}}\right)}. The sensitivity of the function f basically captures the magnitude by which a single individual’s data can change the function f in the worst case. Therefore, intuitively, it captures the uncertainty in the response that we must introduce in order to hide the participation of a single individual. For counting queries the sensitivity is 1.

### A.2 Data-Dependent Differentially Private Parameter Learning for DGMs Cntd.

#### A.2.1 Consistency between noisy marginal tables

The objective of this step is to input the set of noisy marginal tables \tilde{M}_{i} and compute perturbed versions of these tables that are mutually consistent.

###### Definition A.2.

Two noisy marginal tables \tilde{M}_{i} and \tilde{M}_{j} are defined to be consistent (denoted by \equiv) if and only if the marginal table over attributes in Attr(\tilde{M}_{i})\cap Attr(\tilde{M}_{j}) reconstructed from \tilde{M}_{i} is exactly the same as the one reconstructed from \tilde{M}_{j} , that is,

\displaystyle\tilde{M}_{i}[Attr(\tilde{M}_{i})\cap Attr(\tilde{M}_{j})]\equiv% \tilde{M}_{j}[Attr(\tilde{M}_{i})\cap Attr(\tilde{M}_{j})]. | (16) |

where Attr(M)is the set of attributes on which marginal table M is defined.

Mutual Consistency on a Set of Attributes:

Assume a set of tables \{\tilde{M}_{i},\cdots,\tilde{M}_{j}\} and let A=Attr(\tilde{M}_{i})\cap\cdots\cap Attr(\tilde{M}_{j}). Mutual consistency, i.e., \tilde{M}_{i}[A]\equiv\cdots\equiv\tilde{M}_{j}[A] is achieved as follows:

(1) First compute the best approximation for the marginal table \tilde{M}_{A} for the attribute set A as follows

\displaystyle\tilde{M}_{A}[A^{\prime}]=\frac{1}{\sum_{t=1}^{j}{\epsilon_{t}}}% \sum_{t=i}^{j}\epsilon_{t}\cdot\tilde{M}_{t}[A^{\prime}],A^{\prime}\in A | (17) |

(2) Update all \tilde{M}_{t}s to be consistent with \tilde{M}_{A}. Any counting query c is now answered as

\displaystyle\tilde{M}_{t}(c)=\tilde{M}_{t}(c)+\frac{|dom(A)|}{|dom(Attr\big{(% }\tilde{M})|}(\tilde{M}_{A}(a)-\tilde{M}_{t}(a)\big{)} | (18) |

where a is the query c restricted to attributes in A and \tilde{M}_{t}(c) is the response of c on \tilde{M}_{t} .

Overall Consistency:

(1) Take all sets of attributes that
are the result of the intersection of some subset of \bigcup_{i=k+1}^{d}\{X_{i}\cup X_{pa_{i}}\}; these
sets form a partial order under the subset relation.

(2) Obtain a topological sort of these sets, starting from
the empty set.

(3) For each set A, one finds all tables that
include A, and ensures that these tables are consistent on A.

### A.3 Error Bound Analysis Cntd.

In this section we present our results on the lower and upper bound of the error in inference queries.
Preliminaries and Notations:

For the proofs, we use the following notations.
Let X be the attribute that is being eliminated and let \mathpzc{A}=\bigcup_{\phi_{i}}Attr(\phi_{i})\textbackslash X where Attr(\phi) denotes the set of attributes in \phi.
For some a\in dom(\mathpzc{A}), from the variable elimination algorithm (Section 8.1.3) for a sum-product term (Eq. (15)) we have

\displaystyle\boldsymbol{\phi}{A}[a]=\sum_{x}\prod_{i=1}^{t}\phi_{i}[x,a] | (19) |

Let us assume that factor \phi[a,x] denotes that Value(Attr(\phi))\in\{a\} and X=x. Recall that after computing a sum-product task (given by Eq. (19)), for the variable elimination algorithm (Appendix Algorithm 3), we will be left with a factor term over the attribute set \mathpzc{A}. For example, if the elimination order for the variable elimination algorithm on our example DGM (Figure 2) is given by \prec=\{A,B,C,D,E,F\} and the attributes are binary valued, then the first sum-product task will be of the following form \mathpzc{A}=\{B,C\},dom(\mathpzc{A})=\{(0,0),(0,1),(1,0),(1,1)\} and the RHS \phi_{i}s in this case happen to be the true parameters of the DGM,

\displaystyle\boldsymbol{\phi}_{B,C}[0,0]=\Theta[A=0]\cdot\Theta[C=0|A=0,B=0]+% \Theta[A=1]\cdot\Theta[C=0|A=1,B=0] | ||

\displaystyle\boldsymbol{\phi}_{B,C}[0,1]=\Theta[A=0]\cdot\Theta[C=1|A=0,B=0]+% \Theta[A=1]\cdot\Theta[C=1|A=1,B=0] | ||

\displaystyle\boldsymbol{\phi}_{B,C}[1,0]=\Theta[A=0]\cdot\Theta[C=1|A=0,B=0]+% \Theta[A=1]\cdot\Theta[C=1|A=1,B=0] | ||

\displaystyle\boldsymbol{\phi}_{B,C}[1,1]=\Theta[A=0]\cdot\Theta[C=1|A=0,B=1]+% \Theta[A=1]\cdot\Theta[C=1|A=1,B=1] | ||

\displaystyle\boldsymbol{\phi}_{B,C}=[\boldsymbol{\phi}_{B,C}[0,0],\boldsymbol% {\phi}_{B,C}[0,1],\boldsymbol{\phi}_{B,C}[1,0],\boldsymbol{\phi}_{B,C}[1,1]] |

#### A.3.1 Lower Bound

###### Theorem A.5.

For a DGM \mathcal{N}, for any sum-product term of the form \boldsymbol{\phi}_{\mathpzc{A}}=\sum_{x}\prod_{i=1}^{t}\phi_{i},t\in\{2,\cdots% ,n\} in the VE algorithm, we have

\displaystyle\delta_{\phi_{\mathpzc{A}}}\geq\sqrt{\eta-1}\delta^{min}_{\phi_{i% }[a,x]}(\phi^{min}[a,x])^{\eta-2} | (20) |

where X is the attribute being eliminated, Attr(\phi) denotes the set of attributes in \phi, \mathpzc{A}=\bigcup_{\phi_{i}}Attr(\phi_{i})/X,x\in dom(X),a\in dom(\mathpzc{A% }),\delta^{min}_{\phi_{i}[a,x]}=min_{i,a,x}\{\delta_{\phi_{i}[a,x]}\}, \phi^{min}[a,x]=min_{i,a,x}\{\phi_{i}[a,x]\} and \eta=max_{X_{i}}\{\text{in-degree}(X_{i})+\text{out-degree}(X_{i})\}+1.

###### Proof.

Proof Structure:

The proof is structured as follows. First, we compute the error for a single term \boldsymbol{\phi}_{\mathpzc{A}}[a],a\in dom(\mathpzc{A}) (Eq. (21),(22),(23)). Next we compute the total error \delta_{\boldsymbol{\phi}_{\mathpzc{A}}} by summing over \forall a\in dom(\mathpzc{A}). This is done by dividing the summands into two types of terms (a) \Upsilon_{\phi_{1}[a,x]} (b) \delta_{\prod_{i=1}^{t}\phi_{i}[a,x]}\prod_{i=1}^{t}\phi_{i}[a,x] (Eq. (25),(26)). We prove that the summation of first type of terms (\Upsilon_{\phi_{1}[x]}) can be lower bounded by 0 non-trivially. Then we compute a lower bound on the terms of the form \delta_{\prod_{i=2}^{t}\phi_{i}[a,x]}\prod_{i=2}^{t}\phi_{i}[a,x] (Eq. (31)) which gives our final answer (Eq. (32)).

Step 1: Computing error in a single term \boldsymbol{\phi}_{\mathpzc{A}}[a], \delta_{\boldsymbol{\phi}_{\mathpzc{A}}[a]}

The error in \boldsymbol{\phi}{A}[a], due to noise injection is given by ,

\displaystyle\delta_{\boldsymbol{\phi}_{\mathpzc{A}}[a]}=\Bigg{|}\sum_{x}\prod% _{i=1}^{t}\phi_{i}[x,a]-\sum_{x}\prod_{i=1}^{t}\tilde{\phi}_{i}[a,x]\Bigg{|} | |||

\displaystyle=\Bigg{|}\sum_{x}\Big{(}\phi_{1}[x,a]\prod_{i=2}^{t}\phi_{i}[x,a]% -\tilde{\phi}_{1}[x,a]\prod_{i=2}^{t}\tilde{\phi}_{i}[x,a]\Big{)}\Bigg{|} | |||

\displaystyle=\Bigg{|}\sum_{x}\Big{(}\phi_{1}[x,a]\prod_{i=2}^{t}\phi_{i}[x,a]% -\tilde{\phi}_{1}[x,a]\prod_{i=2}^{t}(\phi_{i}[x,a]\pm\delta_{{\phi}_{i}[x,a]}% )\Big{)}\Bigg{|} | (21) |

Using the rule of standard error propagation, we have

\displaystyle\delta_{\prod_{i=2}^{t}{\phi_{i}[x,a]}}=\prod_{i=2}^{t}\tilde{% \phi}_{i}[x,a]\sqrt{\sum_{i=2}^{t}\Big{(}\frac{\delta_{\phi_{i}[x,a]}}{\phi_{i% }[x,a]}\Big{)}^{2}} | (22) |

Thus from the above equation (Eq. (22)) we can rewrite Eq. (21) as follows,

\displaystyle=\Bigg{|}\sum_{x}\Big{(}\phi_{1}[x,a]\prod_{i=2}^{t}\phi_{i}[x,a]% -\tilde{\phi_{1}[x,a]}\prod_{i=2}^{t}\phi_{i}[x,a](1\pm\delta_{\prod_{i=2}^{t}% \phi_{i}[x,a]})\Big{)}\Bigg{|} | |||

\displaystyle=\Bigg{|}\sum_{x}\Big{(}(\phi_{1}[a,x]-\tilde{\phi_{1}[a,x]})% \prod_{i=2}^{t}\phi_{i}[a,x]\pm\delta_{\prod_{i=1}^{t}\phi_{i}[a,x]}\prod_{i=1% }^{t}\phi_{i}[a,x]\Big{)}\Bigg{|} | (23) |

Step 2: Compute total error \delta_{\boldsymbol{\phi}{A}}

Now, total error in \phi{A} is

\displaystyle\delta_{\phi{A}}=\sum_{a}\delta_{\boldsymbol{\phi}{A}[a]} | (24) |

Collecting all the product terms from the above equation (24) with \phi_{1}[a,x]-\tilde{\phi}_{1}[a,x] as a multiplicand, we get

\displaystyle\Upsilon_{\phi_{1}[a,x]}=(\phi_{1}[a,x]-\tilde{\phi}_{1}[a,x])% \sum_{a}\prod_{i=2}^{t}\phi_{i}[a,x] | (25) |

Thus \delta_{\boldsymbol{\phi}{A}} can be rewritten as

\displaystyle\delta_{\boldsymbol{\phi}{A}}=\sum_{a,x}{\Upsilon}_{\phi_{1}[a,x]% }\pm\sum_{a,x}\prod_{i=1}^{t}\phi_{i}[a,x]\delta_{\prod_{i=2}^{t}\phi_{i}[a,x]} | (26) |

First we show that for a specific DGM we have \sum_{a,x}\Upsilon_{\phi_{1}[a,x]}=0 as follows. Let us assume that the DGM has Attr(\phi_{1})=X. Thus \phi_{1}[a,x] reduces to just \phi_{1}[x].

\displaystyle\Upsilon_{\phi_{1}[x]}=(\phi_{1}[x]-\tilde{\phi}_{1}[x])\sum_{a}% \prod_{i=2}^{t}\phi_{i}[x] | ||

\displaystyle=(\phi_{1}[x]-\tilde{\phi}_{1}[x])\Big{(}\sum_{a_{k}}\cdots\sum_{% a_{1}}\prod_{i=2}^{t}\phi_{i}[a_{1},\cdots,a_{k},x]\Big{)} | ||

\displaystyle] | ||

\displaystyle=(\phi_{1}[x]-\tilde{\phi}_{1}[x])\Big{(}\sum_{a_{k}}\cdots\sum_{% a_{2}}\prod_{i=3}^{t}\phi_{i}[a_{2},\cdots,a_{k},x]\sum_{a_{1}}\phi_{2}[a_{1},% \cdots,a_{k},x]\Big{)} | ||

\displaystyle [\text{Assuming that }\phi_{2}\mbox{ is the only factor with% attribute $\mathpzc{A}_{1}$}] |

Now each factor \phi_{i} is either a true parameter (CPD) of the DGM \mathcal{N} or a CPD over some other DGM (lemma A.1). Thus, let us assume that \phi_{2} represents a conditional of the form P[\mathpzc{A}_{1}|\mathbf{A},X],\mathbf{A}=\mathpzc{A}/\mathpzc{A}_{1}. Thus we have \sum_{a_{1}}\phi_{2}[a_{1},\cdots,a_{k},x]=\sum_{a_{1}}P[\mathpzc{A}_{1}=a_{1}% |\mathpzc{A}_{2}=a_{2},\cdots,\mathpzc{A}_{k}=a_{k},X=x]=1. Now repeating the above process over all i\in\{3,\cdots,t\}\phi_{i}s , we get

\displaystyle\Upsilon_{\phi_{1}[x]}=\phi_{1}[x]-\tilde{\phi}_{1}[x] | (27) |

For the ease of understanding, we illustrate the above result on our example DGM (Figure 2). Let us assume that the order of elimination is given by \prec=\langle A,B,C,D,E,F\rangle. For simplicity, again we assume binary attributes. Let \phi_{C} be the factor that is obtained after eliminating A and B. Thus the sum-product task for eliminating C is given by

\displaystyle\boldsymbol{\phi}_{D,E}[0,0]=\phi_{C}[C=0]\cdot\Theta[D=0|C=0]% \Theta[E=0|C=0] | ||

\displaystyle +\phi% _{C}[C=1]\cdot\Theta[D=0|C=1]\Theta[E=0|C=1] | ||

\displaystyle\boldsymbol{\phi}_{D,E}[0,1]=\phi_{C}[C=0]\cdot\Theta[D=0|C=0]% \Theta[E=1|C=0] | ||

\displaystyle +\phi% _{C}[C=1]\cdot\Theta[D=0|C=1]\Theta[E=1|C=1] | ||

\displaystyle\boldsymbol{\phi}_{D,E}[1,0]=\phi_{C}[C=0]\cdot\Theta[D=1|C=0]% \Theta[E=0|C=0] | ||

\displaystyle +\phi% _{C}[C=1]\cdot\Theta[D=1|C=1]\Theta[E=0|C=1] | ||

\displaystyle\boldsymbol{\phi}_{D,E}[1,1]=\phi_{C}[C=0]\cdot\Theta[D=1|C=0]% \Theta[E=1|C=0] | ||

\displaystyle +\phi% _{C}[C=1]\cdot\Theta[D=1|C=1]\Theta[E=1|C=1] |

Hence considering noisy \tilde{\boldsymbol{\phi}}_{D,E} we have,

\displaystyle\Upsilon_{\phi_{C}[0]}=(\phi_{C}[C=0]-\tilde{\phi}_{C}[C=0])\cdot% (\Theta[D=0|C=0]\Theta[E=0|C=0] | |||

\displaystyle +\Theta[D=0|C=0]\Theta[E=1|C=0]+\Theta[D=1|C=0]% \Theta[E=0|C=0] | |||

\displaystyle+\Theta[D=1|C=0]\Theta[E=1|C=0]) | |||

\displaystyle=(\phi_{C}[C=0]-\tilde{\phi}_{C}[C=0])\cdot\Big{(}\Theta[D=0|C=0]% \big{(}\Theta[E=0|C=0]+\Theta[E=1|C=0]\big{)} | |||

\displaystyle +\Big{(}\Theta[D=1|C=0% ]\big{(}\Theta[E=0|C=0]+\Theta[E=1|C=0]\big{)}\Big{)} | |||

\displaystyle=(\phi_{C}[C=0]-\tilde{\phi}_{C}[C=0])\cdot\big{(}\Theta[D=0|C=0]% +\Theta[D=1|C=0]\big{)} | |||

\displaystyle\Big{[}\because\Theta[E=0|C=0]+\Theta[E=1|C=0]=1\Big{]} | |||

\displaystyle=\phi_{C}[C=0]-\tilde{\phi}_{C}[C=0] | (28) | ||

\displaystyle\Big{[}\because\Theta[D=0|C=0]+\Theta[D=1|C=0]=1\Big{]} |

Similarly

\displaystyle\Upsilon_{\phi_{C}[1]}=\phi_{C}[C=1]-\tilde{\phi}_{C}[C=1] | (29) |

Now using Eq. (27) and summing over \forall x\in dom(X)

\displaystyle\sum_{x}\Upsilon_{\phi_{1}[x]}=\sum_{x}(\phi_{1}[x]-\tilde{\phi}_% {1}[x]) | |||

\displaystyle=0\Big{[}\because\sum_{x}\phi_{1}[x]=\sum_{x}\tilde{\phi}_{1}[x]=% 1\Big{]} | (30) |

Referring back to our example above, since \phi_{C}[1]+\phi_{C}[0]=\tilde{\phi}_{C}[C=0]+\tilde{\phi}_{C}[C=1], quite trivially

\displaystyle\phi_{C}[C=0]+\phi_{C}[C=1]=\tilde{\phi}_{C}[C=0]+\tilde{\phi}_{C% }[C=1] | ||

\displaystyle\Rightarrow(\phi_{C}[C=0]-\tilde{\phi}_{C}[C=0])+(\phi_{C}[C=1]-% \tilde{\phi}_{C}[C=1])=0 |

Thus, from Eq. (26)

\displaystyle\delta_{\phi{A}}=\sum_{x}{\Upsilon}_{\phi_{1}[x]}\pm\sum_{a,x}% \delta_{\prod_{i=2}^{t}\phi_{i}[a,x]}\prod_{i=1}^{t}\phi_{i}[a,x] | |||

\displaystyle=\sum_{a,x}\delta_{\prod_{i=2}^{t}\phi_{i}[a,x]}\prod_{i=1}^{t}% \phi_{i}[a,x]\Big{[}\text{From Eq. \eqref{eq:Upsilon3}}\text{ and dropping $% \pm$ as we are dealing with errors}\Big{]} | |||

\displaystyle\geq\delta^{min}_{\prod_{i=2}^{t}\phi_{i}[a,x]}\sum_{a,x}\prod_{i% =1}^{t}\phi_{i}[a,x] | |||

\displaystyle\Big{[}\delta^{min}_{\prod_{i=2}^{t}\phi_{i}[a,x]}=\min_{a,x}\Big% {\{}\delta_{\prod_{i=2}^{t}\phi_{i}[a,x]}\Big{\}}\Big{]} | |||

\displaystyle\geq\delta^{min}_{\prod_{i=2}^{t}\phi_{i}[a,x]} | |||

\displaystyle\Big{[}\because\text{By Lemma \ref{lemma:factor} $\phi_{\mathpzc{% A}}$ is a CPD, thus }\sum_{a,x}\prod_{i=1}^{t}\phi_{i}[a,x]\geq 1\Big{]} | |||

\displaystyle=min_{a,x}\Bigg{\{}\prod_{i=2}^{t}\phi_{i}[x,a]\sqrt{\sum_{i=2}^{% t}\Big{(}\frac{\delta_{\phi_{i}[a,x]}}{\phi_{i}[a,x]}\Big{)}^{2}}\Bigg{\}} | |||

\displaystyle\geq min_{a,x}\Bigg{\{}\prod_{i=2}^{t}\phi_{i}[x,a]\sqrt{(t-1)% \Big{(}\frac{\delta^{min}_{\phi_{i}[a,x]}}{\phi_{i}[x,a]}\Big{)}^{2}}\Bigg{\}}% [\delta^{min}_{\phi_{i}[x,a]}={min}_{i,a,x}\Big{\{}\delta_{\phi_{i}[a,x]}\Big{% \}}] | |||

\displaystyle\geq min_{a,x}\Bigg{\{}\sqrt{(t-1)}\frac{\delta^{min}_{\phi_{i}[a% ,x]}}{\phi^{max}_{i}}\prod_{i=2}^{t}\phi_{i}[a,x]\Bigg{\}}\Big{[}\phi^{max}_{i% }=\max_{i}\{\phi_{i}[a,x]\}\Big{]} | |||

\displaystyle\geq\delta^{min}_{\phi_{i}[a,x]}\sqrt{t-1}(\phi^{min}_{i}[a,x])^{% t-2}\Big{[}\mbox{ Assuming }\phi^{min}_{i}[a,x]=min_{i,a,x}\{\phi_{i}[a,x]\}% \Big{]} | (31) |

Now, recall from the variable elimination algorithm that during each elimination step, if Z is the variable being eliminated then we the product term contains all the factors that include Z. For a DGM with graph \mathcal{G}, the maximum number of such factors is clearly \eta=\max_{X_{i}}\{\text{out-degree}(X_{i})+\text{in-degree}(X_{i})\}+1 of \mathcal{G}, i.e., t\leq\eta. Additionally we have \phi^{min}[a,x]\leq\frac{1}{d_{min}}\leq\frac{1}{2} where d_{min} is the minimum size of dom(Attr(\phi)) and clearly d_{min}\geq 2. Since 2^{t}\geq\sqrt{t},t\geq 2, under the constraint that t is an integer and \phi^{min}[a,x]\leq\frac{1}{2}, we have

\displaystyle\delta_{\phi{A}}\geq\sqrt{\eta-1}\delta^{min}_{\phi_{i}[a,x]}(% \phi^{min}[a,x])^{\eta-2} | (32) |

∎

#### A.3.2 Upper Bound

###### Theorem A.6.

For a DGM \mathcal{N}, for any sum-product term of the form \boldsymbol{\phi}{A}=\sum_{x}\prod_{i=1}^{t}\phi_{i},t\in\{2,\cdots,n\} in the VE algorithm with the optimal elimination order, we have

\displaystyle\delta_{\phi{A}}\leq 2\cdot\eta\cdot d^{\kappa}\delta^{max}_{\phi% _{i}[a,x]} |

where X is the attribute being eliminated, \kappa is the treewidth of \mathcal{G}, d is the maximum domain size of an attribute, Attr(\phi) denotes the set of attributes in \phi, \mathpzc{A}=\bigcup_{i}^{t}\{Attr(\phi_{i})\}/X,a\in dom(\mathpzc{A}),x\in dom% (X),\delta^{max}_{\phi_{i}[a,x]}=\max_{i,a,x}\{\delta_{\phi_{i}[a,x]}\} and \eta=\max_{X_{i}}\{\text{in-degree}(X_{i})+\text{out-degree}(X_{i})\}+1.

###### Proof.

Proof Structure:

The proof is structured as follows. First we compute an upper bound for a product of t>0 noisy factors \tilde{\phi}_{i}[a,x],i\in[t] (Lemma A.7). Next we use this lemma, to bound the error, \delta_{\phi{A}}[a], for the factor, \phi{A}[a],a\in dom(\mathpzc{A}) (Eq. (33)). Finally we use this result to bound the total error, \delta_{\phi{A}}, by summing over \forall a\in dom(\mathpzc{A}) (Eq. (34)).

Step 1: Computing the upper bound of the error of a single term \tilde{\boldsymbol{\phi}}{A}[a], \delta_{\boldsymbol{\phi{A}}[a]}

###### Lemma A.7.

For a\in dom(\mathpzc{A}),x\in dom(X)

\displaystyle\prod_{i=1}^{t}\tilde{\phi_{i}}[a,x]\leq\prod_{i=1}^{t}\phi_{i}[a% ,x]+\sum_{i}\delta_{\phi_{i}[a,x]} |

###### Proof.

First we consider the base case when t=2.

Base Case:

\displaystyle\tilde{\phi_{1}}[a,x]\tilde{\phi_{2}}[a,x]=(\phi_{1}[a,x]\pm% \delta_{\phi_{1}[a,x]})(\phi_{2}[a,x]\pm\delta_{\phi_{2}[a,x]}) | ||

\displaystyle\leq(\phi_{1}[a,x]+\delta_{\phi_{1}[a,x]})(\phi_{2}[a,x]+\delta_{% \phi_{2}[a,x]}) | ||

\displaystyle=(\phi_{1}[a,x]\cdot\phi_{2}[a,x]+\delta_{\phi_{1}[a,x]}(\phi_{2}% [a,x]+\delta_{\phi_{2}[a,x]})+\delta_{\phi_{2}[a,x]}\cdot\phi_{1}[a,x]) | ||

\displaystyle\leq(\phi_{1}[a,x]\cdot\phi_{2}[a,x]+\delta_{\phi_{1}[a,x]}+% \delta_{\phi_{2}[a,x]}\phi_{1}[a,x]) | ||

\displaystyle\Big{[}\because(\phi_{2}[a,x]+\delta_{\phi_{2}[a,x]})\leq 1\text{% as }\tilde{\phi_{i}}[a,x]\mbox{ is still a valid probability distribution}% \Big{]} | ||

\displaystyle\leq\phi_{1}[a,x]\cdot\phi_{2}[a,x]+\delta_{\phi_{1}[a,x]}+\delta% _{\phi_{2}[a,x]} | ||

\displaystyle\Big{[}\because\phi_{1}[a,x]<1\Big{]} |

Inductive Case:

Let us assume that the lemma holds for t=k. Thus we have

\displaystyle\prod_{i=1}^{k+1}\tilde{\phi}_{i}[a,x]=\prod_{i=1}^{k}\tilde{\phi% _{i}}[a,x]\cdot\tilde{\phi}_{k+1}[a,x] | ||

\displaystyle\leq(\prod_{i=1}^{k}\phi_{i}[a,x]+\sum_{i}\delta_{\phi_{i}[a,x]})% \cdot(\phi_{k+1}[a,x]+\delta_{\phi_{k+1}}[a,x]) | ||

\displaystyle\leq\prod_{i=1}^{k+1}\phi_{i}[a,x]+\sum_{i}\delta_{\phi_{i}[a,x]}% \cdot(\phi_{k+1}[a,x]+\delta_{\phi_{k+1}}[a,x])+\delta_{\phi_{k+1}}[a,x]\prod_% {i=1}^{k}\phi_{i}[a,x] | ||

\displaystyle\leq\prod_{i=1}^{k+1}\phi_{i}[a,x]+\sum_{i=1}^{k+1}\delta_{\phi_{% i}[a,x]}+\delta_{\phi_{k+1}}[a,x]\prod_{i=1}^{k}\phi_{i}[a,x] | ||

\displaystyle+\delta_{\phi_{k+1}[a,x]})\leq 1\text{ as }\tilde{\phi}_{k+1}[a,x% ]\mbox{ is still a valid probability distribution}] | ||

\displaystyle\leq\prod_{i=1}^{k+1}\phi_{i}[a,x]+\sum_{i=1}^{k+1}\delta_{\phi_{% i}[a,x]}[\because\forall i,\phi_{i}[a,x]\leq 1] |

Hence, we have

\displaystyle\prod_{i=1}^{t}\tilde{\phi_{i}}[a,x]\leq\prod_{i=1}^{t}\phi_{i}[a% ,x]+\sum_{i}\delta_{\phi_{i}[a,x]} |

∎

Next, we compute the error for the factor, \boldsymbol{\phi}{A}[a],a\in dom(\mathpzc{A}) as follows

\displaystyle\delta_{\boldsymbol{\phi}_{A}[a]}=\Big{|}\sum_{x}\prod_{i=1}^{t}% \phi_{i}[a,x]-\sum_{x}\prod_{i=1}^{t}\tilde{\phi_{i}}[a,x]\Big{|} | |||

\displaystyle=\Big{|}\sum_{x}\prod_{i=1}^{t}\phi_{i}[a,x]-\phi_{1}[a,x]\prod_{% i=2}^{t}\tilde{\phi_{i}[a,x]}\Big{|} | |||

\displaystyle=\Big{|}\sum_{x}\prod_{i=1}^{t}\phi_{i}[a,x]-\tilde{\phi_{1}}[a,x% ]\prod_{i=2}^{t}(\phi_{i}[a,x]\pm\delta_{\phi_{i}[a,x]})\Big{|} | |||

\displaystyle\leq\Big{|}\sum_{x}\Big{(}\prod_{i=1}^{t}\phi_{i}[a,x]-\tilde{% \phi}_{1}[a,x](\prod_{i=2}^{t}\phi_{i}[a,x]+\sum_{i=2}^{t}\delta_{\phi_{i}[a,x% ]})\Big{)}\Big{|}\Big{[}\mbox{Using Lemma }\ref{lem:help3}\Big{]} | |||

\displaystyle\leq\Big{|}\sum_{x}\Big{(}(\phi_{1}[a,x]-\tilde{\phi}_{1}[a,x])% \prod_{i=2}^{t}\phi_{i}[a,x]+\tilde{\phi}_{1}[a,x]\sum_{i=2}^{t}\delta_{\phi_{% i}[a,x]}\Big{)}\Big{|} | |||

\displaystyle\leq\Big{|}\sum_{x}\Big{(}(\phi_{1}[a,x]-\tilde{\phi_{1}}[a,x])% \prod_{i=2}^{t}\phi_{i}[a,x]+\eta\tilde{\phi_{1}}[a,x]\delta^{*}_{\phi_{i}[a,x% ]}\Big{)}\Big{|} | |||

\displaystyle\Big{[}\because t\leq\eta\text{ and assuming }\delta^{*}_{\phi_{i% }[a,x]}=\max_{i,x}\{\delta_{\phi_{i}[a,x]}\}\Big{]} | |||

\displaystyle=\Big{|}\sum_{x}(\phi_{1}[a,x]-\tilde{\phi}_{1}[a,x])\prod_{i=2}^% {t}\phi_{i}[a,x]+\eta\delta^{*}_{\phi_{i}[a,x]}\sum_{x}\tilde{\phi}_{1}[a,x]% \Big{|} | |||

\displaystyle\leq\sum_{x}\Big{|}\phi_{1}[a,x]-\tilde{\phi}_{1}[a,x]\Big{|}+% \eta\delta^{*}_{\phi_{i}[a,x]}\sum_{x}\tilde{\phi}_{1}[a,x] | (33) | ||

\displaystyle\Big{[}\because\phi_{i}[a,x]\leq 1\Big{]} |

Step 2: Computing the upper bound of the total error \delta_{\boldsymbol{\phi}{A}}

Now summing over \forall a\in dom(\mathpzc{A}),

\displaystyle\delta_{\boldsymbol{\phi}{A}}=\sum_{a}\delta_{\boldsymbol{\phi}{A% }[a]} | ||

\displaystyle\leq\sum_{a}\Big{(}\sum_{x}|\phi_{1}[a,x]-\tilde{\phi}_{1}[a,x]|+% \eta\delta^{*}_{\phi_{i}[a,x]}\sum_{x}\tilde{\phi}_{1}[a,x]\Big{)}[\text{From % eq \eqref{eq1}}] | ||

\displaystyle=\delta_{\phi_{1}}+\eta\delta^{max}_{\phi_{i}[a,x]}\sum_{a}\sum_{% x}\tilde{\phi}_{1}[a,x]\Big{[}\delta^{max}_{\phi_{i}[a,x]}=\max_{a}\{\delta^{*% }_{\phi_{i}[a,x]}\}\Big{]} |

Now by Lemma A.2, maximum size of \mathpzc{A}\cup X is given by the treewidth of the DGM, \kappa. Thus from the fact that \phi_{1} is a CPD (Lemma A.1), we observe that \sum_{a}\sum_{x}\tilde{\phi}_{1}[a,x] is maximized when \phi_{1}[a,x] is of the form P[A^{\prime}|\mathbf{A}],A^{\prime}\in\mathpzc{A}\cup X,|A^{\prime}|=1,\mathbf% {A}=(\mathpzc{A}\cup X)/A^{\prime} and is upper bounded by d^{\kappa} where d is the maximum domain size of an attribute.

\displaystyle\delta_{\boldsymbol{\phi}{A}}\leq\delta_{\phi_{1}}+\eta d^{\kappa% }\delta^{max}_{\phi_{i}[a,x]}[\text{By lemma \ref{lemma:treewidth} and that }% \phi_{1}\mbox{ is CPD from lemma \ref{lemma:factor}}] | |||

where \kappa is the treewidth of \mathcal{G} and d is the maximum domain size of an attribute | |||

\displaystyle\leq 2\cdot\eta\cdot d^{\kappa}\delta^{max}_{\phi_{i}[a,x]}\Big{[% }\because\delta_{\phi_{1}}\leq\eta\cdot d^{\kappa}\delta^{max}_{\phi_{i}[a,x]}% \Big{]} | (34) |

∎