# On Tractable Computation of Expected Predictions

###### Abstract

Computing expected predictions has many interesting applications in areas such as fairness, handling missing values, and data analysis. Unfortunately, computing expectations of a discriminative model with respect to a probability distribution defined by an arbitrary generative model has been proven to be hard in general. In fact, the task is intractable even for simple models such as logistic regression and a naive Bayes distribution. In this paper, we identify a pair of generative and discriminative models that enables tractable computation of expectations of the latter with respect to the former, as well as moments of any order, in case of regression. Specifically, we consider expressive probabilistic circuits with certain structural constraints that support tractable probabilistic inference. Moreover, we exploit the tractable computation of high-order moments to derive an algorithm to approximate the expectations, for classification scenarios in which exact computations are intractable. We evaluate the effectiveness of our exact and approximate algorithms in handling missing data during prediction time where they prove to be competitive to standard imputation techniques on a variety of datasets. Finally, we illustrate how expected prediction framework can be used to reason about the behaviour of discriminative models.

## 1 Introduction

Learning predictive models like regressors or classifiers from data has become a routine exercise in machine learning nowadays. Nevertheless, making predictions on unseen data is still a highly challenging task for many real-world applications. This is even more true when these are affected by uncertainty, e.g., in the case of noisy or missing observations.

A principled way to deal with this kind of uncertainty would be to probabilistically reason about the expected outcomes of a predictive model on a particular feature distribution. That is, to compute mathematical expectations of the predictive model w.r.t. a generative model representing the feature distribution. This is a common need that arises in many scenarios including dealing with missing data Little and Rubin (2019); Khosravi et al. (2019), performing feature selection Yu et al. (2009); Choi et al. (2012, 2017), seeking explanations Ribeiro et al. (2016); Lundberg and Lee (2017); Chang et al. (2019) or determining how “fair” the learned predictor is Zafar et al. (2015, 2017).

While dealing with the above expectations is ubiquitous in machine learning, computing them exactly is, however, generally infeasible Roth (1996). As noted in Khosravi et al. (2019), computing the expected predictions of an arbitrary discriminative models w.r.t. an arbitrary generative model is in general computationally intractable. As one would expect, the more expressive these models become, the harder it is to compute the expectations. More interestingly, even resorting to simpler discriminative models like logistic regression does not help reducing the complexity of such a task: computing the first moment of its predictions w.r.t. a naive Bayes model is known to be NP-hard Khosravi et al. (2019).

In this work, we introduce a pair of expressive generative and discriminative models for regression, for which it is possible to compute not only expectations, but any moment of the form efficiently. We leverage recent advancements in probabilistic circuit representations. Specifically, we formally establish that generative and discriminative circuits enable computations in polynomial time to the size of the circuits when they are subject to some structural constraints, which however do not hinder their expressiveness.

Moreover, we demonstrate that for classification even the aforementioned structural constraint cannot guarantee computations in tractable time. However, efficiently approximating them becomes doable in polynomial time by leveraging our algorithm for the computations of arbitrary moments.

Lastly, we investigate applications of computing expectations. We first consider the challenging scenario of missing values at test time. There, we empirically demonstrate that computing expectations of a discriminative circuit w.r.t. a generative one is not only a more robust and accurate option than many imputation baselines for regression, but also for classification. In addition, we show how we can leverage this framework for exploratory data analysis to understand behaviour of predictive models within different sub-populations.

## 2 Expectations and higher order moments of discriminative models

We use uppercase letters for random variables, e.g., , and lowercase letters for their assignments e.g., . Analogously, we denote sets of variables in bold uppercase, e.g., and their assignments in bold lowercase, e.g., . The set of all possible values that can take is denoted as .

Let a probability distribution over and be a discriminative model, e.g., a regressor, that assigns a real value (outcome) to each complete input configuration (features). The task of computing the -th moment of with respect to the distribution is defined as:

(1) |

Computing moments of arbitrary degree allows one to probabilistically reason about the outcomes of . That is, it provides a description of the distribution of its predictions assuming as the data-generating distribution. For instance, we can compute the mean of w.r.t. : or reason about the dispersion (variance) of its outcomes: .

These computations can be a very useful tool to reason in a principled way about the behaviour of in presence of uncertainty, such as making predictions with missing feature values Khosravi et al. (2019) or deciding a subset of to observe Krause and Guestrin (2009); Yu et al. (2009). For example, given a partial assignment to a subset , the expected prediction of over the unobserved variables can be computed as , which is equivalent to .

Unfortunately, computing arbitrary moments, and even just the expectation, of a discriminative model w.r.t. an arbitrary distribution is, in general, computationally hard. Under the restrictive assumptions that fully factorizes, i.e., , and that is a simple linear model of the form , computing expectations can be done in linear time. However, the task suddenly becomes NP-hard even for slightly more expressive models, for instance when is a naive Bayes distribution and is a logistic regression (a generalized linear model with a sigmoid activation function). See Khosravi et al. (2019) for a discussion.

In Section 4, we will propose a pair of a generative and discriminative models that are highly expressive and yet still allows for polytime computation of exact moments and expectations of the latter w.r.t. the former. We review the necessary background material in the next Section.

## 3 Generative and discriminative circuits

This section introduces the pair of circuit representations we choose as expressive generative and discriminative models. In both cases, we assume the input is discrete. We later establish under which conditions computation of expected predictions becomes tractable.

#### Logical circuits

A logical circuit Darwiche and Marquis (2002); Darwiche (2003) is a directed acyclic graph representing a logical sentence where each node encodes a logical sub-formula, denoted as . Each inner node in the graph is either an AND or an OR gate, and each leaf (input) node encodes a Boolean literal (e.g., or ). We denote the set of children nodes of a gate as . A node is said to be satisfied by the assignment , written , if conforms to the logical formula encoded by . Fig. 1 depicts some examples of logical circuits. Several syntactic properties of circuits enable efficient logical and probabilistic reasoning over them Darwiche and Marquis (2002); Shen et al. (2016). We now review them as they will be pivotal for our efficient computations of expectations and high-order moments in Section 4.

#### Syntactic Properties

A circuit is said to be decomposable if for every AND gate its inputs depend on disjoint sets of variables. For notational simplicity, we will assume decomposable AND gates to have two inputs, denoted (eft) and (ight) children, depending on variables and respectively. In addition, a circuit satisfies structured decomposability if each of its AND gates decomposes according to a vtree, a binary tree structure whose leaves are the circuit variables. That is, the (resp. ) child of an AND gate depends on variables that appear on the left (resp. right) branch of its corresponding vtree node. Fig. 1 shows a vtree and visually maps its nodes to the AND gates of two circuit samples. A circuit is smooth if for an OR gate all its children depend on the same set of variables. Lastly, a circuit is deterministic if, for any input, at most one child of every OR node has a non-zero output. For example, Fig. 0(c) highlights in red, from the root to the leaves, the child nodes that have non-zero outputs, given input . ; note that every OR gate in Fig. 0(c) has at most one hot wire input.

#### Generative circuits

As we will see in Section 4, for the tractable computation of moments we require the generative circuit modeling to satisfy structured decomposability and smoothness.
A well-known example of such circuit is the probabilistic sentential decision diagram (PSDD) Kisa et al. (2014).^{1}^{1}1PSDDs by definition also satisfy determinism, but we do not require this property for computing moments.

A PSDD is characterized by its logical circuit structure and parameters which are assigned to inputs to each OR gate. Intuitively, each PSDD node recursively defines a distribution over a subset of the variables appearing in the sub-circuit rooted at it. More precisely:

(2) |

Here, indicates whether the leaf is satisfied by input . Moreover, and indicate the subsets of configuration restricted to the decomposition defined by an AND gate over its (resp. ) child. As such, an AND gate of a PSDD represents a factorization over independent sets of variables, whereas an OR gate defines deterministic mixture models.

PSDDs allow for the exact computation of the probability of complete and partial configurations in time linear in the size of the circuit. They have been successfully employed as state-of-the-art density estimators not only for unstructured Liang et al. (2017) but also for structured feature spaces Choi et al. (2015a); Shen et al. (2017, 2018).

#### Discriminative circuits

For the discriminative model , we adopt and extend the semantics of logistic circuits (LCs), discriminative circuits recently introduced for classification Liang and Van den Broeck (2019). An LC is defined by a structured decomposable, smooth and deterministic logical circuit with parameters on inputs to OR gates. It acts as a classifier on top of a rich set of non-linear features, extracted by its logical circuit structure. Specifically, an LC associates each input to an embedding representation , where each feature corresponds to the logical formula of an AND gate .

Classification is performed on this new feature representation by applying a sigmoid non-linearity: and as simple logistic regression, is amenable to convex parameter optimization. Alternatively, one can fully characterize an LC by recursively defining a prediction function as the output of each node in it. Given an input , each node in an LC, indeed computes:

(3) |

Again, is an indicator for , effectively enforcing determinism in LCs. Then classification is done by applying a sigmoid function to the output of the circuit root : . The increased expressive power of LCs w.r.t. simple linear regressors lies in the rich representations they learn, which in turn rely on the underlying circuit structure as a powerful feature extractor Vergari et al. (2018, 2016).

While LCs have been introduced for classification and were shown to outperform larger neural networks Liang and Van den Broeck (2019), we leverage them for regression. That is, we are interested in computing the expectations of the output of the root node w.r.t. a generative model . We call an LC when no sigmoid function is applied to a regression circuit (RC). As we will show in the next section, we are able to exactly compute any moment of an RC w.r.t. a generative circuit in time polynomial in the size of the circuits, if shares the same vtree of .

## 4 Computing expectations and moments for circuit pairs

We now introduce our main result, which leads to efficient algorithms for tractable Expectation and Moment Computation of Circuit pairs (EC and MC), in which the discriminative model is an RC and the generative counterpart a PSDD, sharing the same vtree structure.

###### Theorem 1.

Let and be root nodes of a PSDD and RC over with the same vtree, and , be their number of nodes respectively.
Then, the moment of w.r.t. distribution encoded by , , can be computed exactly in
by the MC algorithm.^{2}^{2}2This is a very loose upper bound since we only look at pairs of nodes in the circuit that correspond to the same vtree node. A tighter bound would be where is a non-leaf vtree node and are the number of children of the node in the two circuits that correspond to .

We then investigate how this result can be generalized to arbitrary circuit pairs and how restrictive the above structural requirement is. In fact, we demonstrate how computing expectations and moments for circuit pairs not sharing a vtree is #P-hard. Furthermore, we address the hardness of computing expectations for an LC w.r.t. a PSDD–due to the introduction of the sigmoid function over –by approximating it through the tractable computation of moments.

### 4.1 Ec: Expectations of regression circuits

Intuitively, the computation of expectations becomes tractable because we can “break it down” to the leaves of both the PSDD and RC circuits, where it reduces to simple computations. Indeed, two circuits sharing the same vtree not only ensure that a pair of nodes at the same level will be of the same type, i.e., either both AND or OR gates, but they will also depend on exactly the same set of variables. This will enable a polynomial time decomposition.

We will now show how this computation recursively decomposes over pairs of OR and AND gates, starting from the roots of the PSDD and RC . We refer the reader to the Appendix for detailed proofs of all Propositions and Theorems in this Section. Without loss of generality, we will assume that the roots of both and are OR gates, and that each level of the circuits alternates between AND and OR gates.

###### Proposition 1.

Let and be OR gates of a PSDD and an RC, respectively. Then the expectation of the regressor w.r.t. distribution is:

Above proposition illustrates how the expectation of an OR gate of an RC w.r.t. an OR gate in the PSDD is a weighted sum of the expectations of the child nodes. The number of smaller expectations to be computed is quadratic in the number of children. More specifically, one now has to compute expectations of two different functions w.r.t. the children of PSDD .

First, is the expectation of the indicator function associated to the -th child of (see Eq. 3) w.r.t. the -th child node of . Intuitively, this translates to the probability of the logical formula being satisfied according to the distribution encoded by . Fortunately, this can be computed efficiently, in time quadratic in the size of both circuits as already demonstrated in Choi et al. (2015a).

On the other hand, computing the other expectation term requires a novel algorithm tailored to RCs and PSDDs. We next show how to further decompose this expectation from AND gates to their OR children.

###### Proposition 2.

Let and be AND gates of a PSDD and an RC, respectively. Let and (resp. and ) be the left and right children of (resp. ). Then the expectation of function w.r.t. distribution is:

We are again left with the task of computing expectations of the RC node indicator functions, i.e., and , which can also be done by exploiting the algorithm in Choi et al. (2015a). Furthermore, note that the other expectation terms ( and ) can readily be computing using Proposition 1, since they concern pairs of OR nodes.

We briefly highlight how determinism in the regression circuit plays a crucial role in enabling this computation. In fact, OR gates being deterministic ensures that the otherwise non-decomposable product of indicator functions , where is a parent OR gate of an AND gate , results to be equal to . We refer the readers to Appendix A.3 for a detailed discussion.

Recursively, one is guaranteed to reach pairs of leaf nodes in the RC and PSDD, for which the respective expectations can be computed in by checking if their associated Boolean indicators agree, and by noting that if is a leaf (see Eq. 3). Putting it all together, we obtain the recursive procedure shown in Algorithm 1. Here, refer to the algorithm to compute in Choi et al. (2015a). As the algorithm computes expectations in a bottom-up fashion, the intermediate computations can be cached to avoid evaluating the same pair of nodes more than once, and therefore keeping the complexity as stated by our Theorem 2.

### 4.2 Mc: Moments of regression circuits

Our algorithmic solution goes beyond the tractable computation of the sole expectation of RC.
Indeed, any arbitrary order moment of can be computed w.r.t. still in polynomial time.
We call this algorithm MC and we delineate its main routines with the following Propositions:^{3}^{3}3
The algorithm MC can easily be derived from EC in Algorithm 1, using the equations in this section.

###### Proposition 3.

Let and be OR gates of a PSDD and an RC, respectively. Then the -th moment of the regressor w.r.t. distribution is:

###### Proposition 4.

Let and be AND gates of a PSDD and an RC, respectively. Let and (resp. and ) be the left and right children of (resp. ). Then the -th moment of function w.r.t. distribution is:

Analogous to computing simple expectations, by recursively and alternatively applying Propositions 3 and 4, we arrive at the moments of the leaves at both circuits, while gradually reducing the order of the involved moments.

Furthermore, the lower-order moments in Proposition 4 that decompose to and children, e.g., , can be computed by noting that they reduce to:

(4) |

Note again that these computations are made possible by the interplay of determinism and shared vtrees between and . From the former follows the idempotence of indicator node functions: . The latter ensures that the AND gate children of a pair of OR gates in and decompose in the same way, thereby enabling efficient computations.

Given this, a natural question arises: “Would not requiring a PSDD and RC to have the same vtree structure still allow for the tractable computation of ?”. Unfortunately, this is not the case, as we demonstrate in the following Theorem.

###### Theorem 2.

Computing any moment of an RC w.r.t. a PSDD distribution is #P-hard if and do not share the same vtree.

At a high level, we can reduce #SAT, a well known #P-complete problem, to above expectation problem. Given a choice of different vtrees, we can construct an RC and a PSDD in time polynomial to the size of a CNF formula such that its model count can be computed using the expectation of the RC w.r.t. the PSDD. We refer to Appendix A.3 for more details.

So far, we have focused our analysis to RCs, the analogous of LCs for regression. One would hope that the efficient computations of EC could be carried on to LCs to compute the expected predictions of classifiers. However, the application of the sigmoid function on the regressor , even when shares the same vtree as , makes the problem intractable, as our next Theorem shows.

###### Theorem 3.

Taking the expectation of an LC w.r.t. a PSDD distribution is NP-hard even if and share the same vtree.

### 4.3 Approximating expectations of classifiers

Theorem 3 leaves us with no hope of computing exact expected predictions in a tractable way even for pairs of generative and discriminative circuits conforming to the same vtree. Nevertheless, we can leverage the ability to efficiently compute the moments of to efficiently approximate the expectation of , with being any differentiable non-linear function including sigmoid . Using a Taylor series approximation around point we define the following -order approximation:

See appendix A.5, for a detailed derivation and more intuition behind this approximation.

## 5 Expected Prediction in action

In this section, we empirically evaluate the usefulness and effectiveness of computing the expected predictions of our discriminative circuits with respect to generative ones.^{4}^{4}4Our implementation of the algorithm and experiments are available at https://github.com/UCLA-StarAI/mc2. First, we tackle the challenging task of making predictions in presence of missing values at test time, for both regression and classification.^{5}^{5}5In case of classification, we use the Taylor expansion approximation we discussed in Section 4.3.
Additionally, we show how our framework can be used to reasoning about the behaviour of predictive models. We employ it to check for biases in their predictions or to search for interesting patterns in the predictions when associated to sub-populations in the data in the context of exploratory data analysis.

### 5.1 Reasoning with missing values: an application

Traditionally, prediction with missing values has been addressed by imputation, which substitutes missing values with presumably reasonable alternatives such as mean or median, estimated from training data Schafer (1999). As these imputation methods are typically invariant to the classifier or regression model of interest Little and Rubin (2019), the notion of expected predictions has recently been proposed to handle missingness by reasoning about the model of interest Khosravi et al. (2019). Formally, we are interested in computing

(5) |

where (resp. ) denotes the configuration of a sample that is missing (resp. observed) at test time. In the case of regression, we can exactly compute Eq. 5 for a pair of generative and discriminative circuits sharing the same vtree by our proposed algorithm, after observing that

(6) |

where is the unnormalized distribution encoded by the generative circuit configured for evidence . That is, the sub-circuits depending on the variables in have been fixed according to the input . This and computing any marginal can be done efficiently in time linear in the size of the PSDD Darwiche (2009).

To demonstrate the generality of our method, we construct a 6-dataset testing suite, four of which are common regression benchmarks from several domains Khiari et al. (2018), and the rest are classification on MNIST and FASHION datasets Yann et al. (2009); Xiao et al. (2017). We compare our method with classical imputation techniques such as standard mean and median imputation, and more sophisticated (and computationally intensive) imputation techniques such as multiple imputations by chained equations (MICE) Azur et al. (2011). Moreover, we adopt a natural and yet strong baseline: imputing the missing values by the most probable explanation (MPE) Darwiche (2009) via the generative circuit . Note that the MPE inference acts as an imputation: it returns the mode of the input feature distribution, while EC would convey a more global statistic of the distribution of the outputs of such a predictive model.

To enforce the discriminative-generative pair of circuits to share the same vtree, we first generate a fixed random and balanced vtree and use use it to guide the respective parameter and structure learning algorithms of LCs and PSDDs. On image data, however, we exploit the already learned and publicly available structure in Liang and Van den Broeck (2019), which scores 99.4% accuracy on MNIST, being competitive to much larger deep models. For that, we learn a PSDD adopting its same vtree. For RCs, we adopt the parameter and structure learning of LCs Liang and Van den Broeck (2019), substituting the logistic regression objective with a ridge regression one during optimization. For structure learning of both LCs and RCs, we considered up to 100 iterates while monitoring the loss on a held out set. For PSDDs we employ the combined parameter and structure learning of Liang et al. (2017) with default parameters and running it up to 1000 iterates until no significant improvement is seen on a held out set.

Figure 2 shows our method outperforming other regression baselines.
This can be explained by the fact that it computes the exact expectation while other techniques make restrictive assumptions to approximate the expectation. Mean and median imputations effectively assume that the features are independent; MICE^{6}^{6}6On the elevator dataset, we reported MICE result only until 30% missing as the imputation method is computationally heavy and required more than 10hr to complete. assumes a fixed dependence formula between the features; and, as already stated, MPE only considers the highest probability term in the expansion of the expectation.

Additionally, as we see in Figure 3, our approximation method for predicted classification, using just the first order expansion , is able to outperform the predictions of the other competitors. This suggests that our method is effective in approximating the true expected values.

These experiments agree with the observations from Khosravi et al. (2019) that, given missing data, probabilistically reasoning about the outcome of a classifier by taking expectations can generally outperform imputation techniques. Our advantage clearly comes from the PSDD learning a better density estimation of the data distribution, instead of having fixed prior assumptions about the features. An additional demonstration to this comes from the good performance of MPE on both datasets. Again, this can be credited to the PSDD learning a good distribution on the features.

### 5.2 Reasoning about predictive models for exploratory data analysis

We now showcase an example of how our framework can be utilized for exploratory data analysis while reasoning about the behavior of a given predictive model. Suppose an insurance company has hired us to analyze both their data and the predictions of their regression model, providing the RC and PSDD circuits learned on the Insurance dataset. This dataset lists the yearly health insurance cost of individuals living in the US with features such as age, smoking habits, and location. Our task is to examine the behavior of the predictions, such as whether they are biased by some sensitive attributes or whether there exist interesting patterns across sub-populations of the data.

We might start by asking: “how different are the insurance costs between smokers and non smokers?” which can be easily computed as

(7) |

by applying the same conditioning as in Eq.s 5 and 6. We can also ask: “is the predictive model biased by gender?” To answer that it would suffice to compute:

(8) |

As expected, being a smoker affects the health insurance costs much more than being male or female. If it were the opposite, we would conclude that the model may be unfair or misbehaving.

In addition to examining the effect of a single feature, we may study the model in a smaller sub-population, by conditioning the distribution on multiple features. For instance, suppose the insurance company is interested in expanding and as part of their marketing plan wants to know the effect of an individual’s region, e.g., southeast () and southwest (), for the sub-population of female () smokers () with one child (). By computing the following quantities, we can discover that the difference of their averages is relevant, but much more relevant is that of their standard deviations, indicating a significantly different treatments between regions:

(9) | |||

(10) |

However, one may ask why we do not estimate these values directly from the dataset. The main issue in doing so is that as we condition on more features, fewer if not zero matching samples are present in the data. For example, only 4 and 3 samples match the criterion asked by the last two queries. Furthermore, it is not uncommon for the data to be unavailable due to sensitivity or privacy concerns, and only the models are available. For instance, two insurance agencies in different regions might want to partner without sharing their data yet.

The expected prediction framework with probabilistic circuits allows us to efficiently compute these queries with interesting applications in explainability and fairness. We leave the more rigorous exploration of their applications for future work.

## 6 Related Work

Using expected prediction to handle missing values was introduced in Khosravi et al. (2019); given a logistic regression model, they learned a conforming Naive bayes model and then computed expected prediction only using the learned naive bayes model. In contrast, we are taking the expected prediction using two distinct models. Moreover, probabilistic circuits are much more expressive models. Imputations are a common way to handle missing features and are a well studied topic. For more detail and a history of the techniques we refer the reader to Buuren (2018); Little and Rubin (2019).

Probabilistic circuits enable a wide range of tractable operations. Given the two circuits, our expected prediction algorithm operated on the pairs of children of the nodes in the two circuits corresponding to the same vtree node and hence had a quadratic run-time. There are other applications that operate on the similar pair of nodes such as: multiplying the distribution of two PSDDs Shen et al. (2016), and computing the probability of a logical formula, given as an SDD, over a joint distribution defined by a PSDD Choi et al. (2015b).

## 7 Conclusion

In this paper we investigated under which model assumptions it is tractable to compute expectations of certain discriminative models. We proved how, for regression, pairing a discriminative circuit with a generative one sharing the same vtree structure allows to compute not only expectations but also arbitrary high-order moments in poly-time. Furthermore, we characterized when the task is otherwise hard, e.g., for classification, when a non-decomposable, non-linear function is introduced. At the same time, we devised for this scenario an approximate computation that leverages the aforementioned efficient computation of the moments of regressors. Finally, we showcased that how the expected prediction framework can help a data analyst to reason about the predictive model’s behaviour under different sub-populations. This opens up several interesting research venues, from applications like reasoning about missing values, to perform feature selection, to scenarios where exact and approximate computations of expected predictions can be combined.

## Acknowledgements

This work is partially supported by NSF grants #IIS-1633857, #CCF-1837129, DARPA XAI grant #N66001-17-2-4032, NEC Research, and gifts from Intel and Facebook Research.

expecta bos olim herba.

## References

- Azur et al. [2011] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1):40–49, 2011.
- Buuren [2018] S. v. Buuren. Flexible imputation of missing data. CRC Press, 2018. URL https://stefvanbuuren.name/fimd/.
- Chang et al. [2019] C.-H. Chang, E. Creager, A. Goldenberg, and D. Duvenaud. Explaining image classifiers by counterfactual generation. In International Conference on Learning Representations, 2019.
- Choi et al. [2012] A. Choi, Y. Xue, and A. Darwiche. Same-decision probability: A confidence measure for threshold-based decisions. International Journal of Approximate Reasoning, 53(9):1415–1428, 2012.
- Choi et al. [2015a] A. Choi, G. Van Den Broeck, and A. Darwiche. Tractable learning for structured probability spaces: A case study in learning preference distributions. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 2861–2868. AAAI Press, 2015a. ISBN 978-1-57735-738-4. URL http://dl.acm.org/citation.cfm?id=2832581.2832649.
- Choi et al. [2015b] A. Choi, G. Van den Broeck, and A. Darwiche. Tractable learning for structured probability spaces: A case study in learning preference distributions. In Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), 2015b.
- Choi et al. [2017] Y. Choi, A. Darwiche, and G. Van den Broeck. Optimal feature selection for decision robustness in bayesian networks. Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Aug 2017.
- Darwiche [2003] A. Darwiche. A differential approach to inference in bayesian networks. J.ACM, 2003.
- Darwiche [2009] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge, 2009.
- Darwiche and Marquis [2002] A. Darwiche and P. Marquis. A knowledge compilation map. Journal of Artificial Intelligence Research, 17:229–264, 2002.
- Khiari et al. [2018] J. Khiari, L. Moreira-Matias, A. Shaker, B. Ženko, and S. Džeroski. Metabags: Bagged meta-decision trees for regression. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 637–652. Springer, 2018.
- Khosravi et al. [2019] P. Khosravi, Y. Liang, Y. Choi, and G. Van den Broeck. What to expect of classifiers? Reasoning about logistic regression with missing features. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019.
- Kisa et al. [2014] D. Kisa, G. Van den Broeck, A. Choi, and A. Darwiche. Probabilistic sentential decision diagrams. In Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR), July 2014.
- Krause and Guestrin [2009] A. Krause and C. Guestrin. Optimal value of information in graphical models. Journal of Artificial Intelligence Research, 35:557–591, 2009.
- Liang and Van den Broeck [2019] Y. Liang and G. Van den Broeck. Learning logistic circuits. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI), 2019.
- Liang et al. [2017] Y. Liang, J. Bekker, and G. Van den Broeck. Learning the structure of probabilistic sentential decision diagrams. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
- Little and Rubin [2019] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. Wiley, 2019.
- Lundberg and Lee [2017] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
- Nash et al. [1994] W. J. Nash, T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48, 1994.
- Ribeiro et al. [2016] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
- Roth [1996] D. Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82(1–2):273–302, 1996.
- Rozenholc et al. [2010] Y. Rozenholc, T. Mildenberger, and U. Gather. Combining regular and irregular histograms by penalized likelihood. Computational Statistics & Data Analysis, 54(12):3313–3323, 2010.
- Schafer [1999] J. L. Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3–15, 1999.
- Shen et al. [2016] Y. Shen, A. Choi, and A. Darwiche. Tractable operations for arithmetic circuits of probabilistic models. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3936–3944. Curran Associates, Inc., 2016.
- Shen et al. [2017] Y. Shen, A. Choi, and A. Darwiche. A tractable probabilistic model for subset selection. In UAI, 2017.
- Shen et al. [2018] Y. Shen, A. Choi, and A. Darwiche. Conditional psdds: Modeling and learning with modular knowledge. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Vergari et al. [2016] A. Vergari, N. Di Mauro, and F. Esposito. Visualizing and understanding sum-product networks. preprint arXiv, 2016. URL https://arxiv.org/abs/1608.08266.
- Vergari et al. [2018] A. Vergari, R. Peharz, N. Di Mauro, A. Molina, K. Kersting, and F. Esposito. Sum-product autoencoding: Encoding and decoding representations using sum-product networks. In AAAI, 2018.
- Xiao et al. [2017] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
- Yann et al. [2009] L. Yann, C. Corinna, and C. J. Burges. The mnist database of handwritten digits, 2009.
- Yu et al. [2009] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao. Active sensing. In Artificial Intelligence and Statistics, pages 639–646, 2009.
- Zafar et al. [2015] M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259, 2015.
- Zafar et al. [2017] M. B. Zafar, I. Valera, M. Rodriguez, K. Gummadi, and A. Weller. From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems, pages 229–239, 2017.

## Supplements “On Tractable Computation of Expected Predictions”

## Appendix A Proofs

### a.1 Proofs of Propositions 1 and 3

We will first prove Proposition 3, from which Proposition 1 directly follows. For a PSDD OR node and RC OR node ,

(11) | ||||

(12) |

Equation 11 follows from determinism as at most one will have a non-zero . In Equation 12, note that we denote, with slight abuse of notation, . This concludes the proof of Proposition 3.

We obtain Proposition 1 by applying above result with :

### a.2 Proofs of Proposition 2 and 4

### a.3 Proof of Theorem 2

The proof is by reduction from the model counting problem (#SAT) which is known to be #P-hard.

Given a CNF formula , let us construct and as follows. For every variable appearing in clause , introduce an auxiliary variable . Then:

Here, denotes the literal of (i.e., or ) in clause . Thus, is the same CNF formula as , except that a variable in appears as several different copies in . The formula ensures that the copied variables are all equivalent. Thus, the model count of must equal the model count of .

Consider a right-linear vtree in which variables appear in the following order: , , . The PSDD sub-circuit involving copies of variable has exactly two model and size that is linear in the number of copies. There are as many such sub-circuits as there are variables in the original formula , each of which can be chained together directly to obtain . The key insight in doing so is that sub-circuits corresponding to different variables are independent of one another. Then, we can construct a PSDD circuit structure whose logical formula represents in polytime. In a single top down pass, we can parameterize the PSDD such that it represents a uniform distribution: each model is assigned a probability of .

Next, consider a right-linear vtree with the variables appearing in the following order: . Then, we can construct a logical circuit that represents in polynomial time, as each variable appears exactly once in the formula. That is, each clause will have a PSDD sub-circuit with linear size (in the number of literals appearing in the clause), and the size of their conjunction will simply be the sum of the sizes of such sub-circuits. We can parameterize it as a regression circuit by assigning 0 to all inputs to OR gates and adding a single OR gate on top of the root node with a weight 1. Then this regression circuit outputs 1 if and only if the input assignment satisfies .

Then the expectation of regression circuit w.r.t. PSDD (which does not share the same vtree) can be used to compute the model count of as follows:

Thus, #SAT can be reduced to the problem of computing expectations of a regression circuit w.r.t. a PSDD that does not share the same vtree. ∎