# Sum-Product Networks for Hybrid Domains

###### Abstract

While all kinds of mixed data—from personal data, over panel and scientific data, to public and commercial data—are collected and stored, building probabilistic graphical models for these hybrid domains becomes more difficult. Users spend significant amounts of time in identifying the parametric form of the random variables (Gaussian, Poisson, Logit, etc.) involved and learning the mixed models. To make this difficult task easier, we propose the first trainable probabilistic deep architecture for hybrid domains that features tractable queries. It is based on Sum-Product Networks (SPNs) with piecewise polynomial leave distributions together with novel nonparametric decomposition and conditioning steps using the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient. This relieves the user from deciding a-priori the parametric form of the random variables but is still expressive enough to effectively approximate any continuous distribution and permits efficient learning and inference. Our empirical evidence shows that the architecture, called Mixed SPNs, can indeed capture complex distributions across a wide range of hybrid domains.

CS Department

TU Dortmund, Germany Antonio Vergari * first.last@uniba.it

CS Department

University of Bari, Italy Nicola Di Mauro first.last@uniba.it

CS Department

University of Bari, Italy Sriraam Natarajan natarasr@indiana.edu

CS Department

Indiana University, USA Floriana Esposito first.last@uniba.it

CS Department

University of Bari, Italy Kristian Kersting last@cs.tu-darmstadt.de

CS Dept. and Centre for CogSci

TU Darmstadt, Germany

* These authors contributed equally to this work

## 1 Introduction

Machine learning has achieved considerable successes in recent years and an ever-growing number of disciplines rely on it. Data is now ubiquitous, and there is great value from understanding the data, building probabilistic models and making predictions with them. However, in most cases, this success crucially relies on the data scientists to posit the right parametric form of the probabilistic model underlying the data, to select a good algorithm to fit to their data, and finally to perform inference on it.

These can be quite challenging even for experts and often go beyond non-experts’ capabilities, specifically in hybrid domains, consisting of mixed—continuous, discrete and/or categorical—statistical types. Building a probabilistic model that is both expressive enough to capture complex dependencies among random variables of different types as well as allows for effective learning and efficient inference is still an open problem.

More precisely, most existing graphical models for hybrid domains—also called mixed models—are limited to particular combinations of variables of parametric forms such as the Gaussian–Ising mixed model Lauritzen and Wermuth (1989), where there are Gaussian and multinomial random variables, and the continuous variables are conditioned on all configurations of the discrete variables. Unfortunately, inference in this Gaussian-Ising mixed graphical model scales exponentially with the number of discrete variables, and only recently, 3-way dependencies have been realized Cheng et al. (2014). Therefore it is not surprising that hybrid Bayesian networks (HBNs) have restricted their attention to simpler parametric forms for the conditional distributions such as conditional linear Gaussian models Heckerman and Geiger (1995).

While extensions based on copulas aim to provide more flexibility Elidan (2010), selecting the best parametric copula distribution for each application requires a significant engineering effort. Probably the most recent approach are Manichean graphical models Yang et al. (2014), and we refer to this paper for an excellent recent overview on mixed graphical models. Manichean models—after the philosophy that loosely places elements into one of two types—specify that each of the conditional distributions is a member of a possibly different univariate exponential family. Although indeed more flexible than Gaussian-Ising mixed models, Manichean models are still demanding, in particular when it comes to inference. Alternatively, one may make a piecewise approximation to continuous distributions Shenoy and West (2011). In their simplest form, piecewise constant functions are often adopted in the form of histograms or staircase functions, and more expressive approximations comprise mixtures of truncated polynomials Langseth et al. (2012) and exponentials Moral et al. (2001). This has resulted in a number of novel inference approaches for hybrid domains Sanner and Abbasnejad (2012); Belle et al. (2015a, b); Morettin et al. (2017).

Although expressive, in particular learning these non-parametric models has not been considered or does not scale. To overcome the difficultness of mixed probabilistic graphical modeling and inspired by the successes of deep models, we introduce Mixed Sum-Product Networks (MSPNs). They are a general class of mixed probabilistic models that, by combining Sum-Product Networks Poon and Domingos (2011) and piecewise polynomials, allow for a large range of exact and tractable inference without making distributional assumptions. Learning MSPNs from data, however, requires novel decomposition and conditioning steps for Sum-Product Networks (SPNs) tailored towards nonparametric distributions. Providing them based on the Rényi Maximum Correlation Coefficient Lopez-Paz et al. (2013)—the first application of it to learning sum-product networks—via a series of variable transformations is our main technical contribution. This then naturally results in the first automated tool for learning multivariate distributions over hybrid domains without requiring users to decide the parametric form of random variables or their dependencies, yet enabling them to answer complex probabilistic queries efficiently on tasks previously unfeasible by classical mixed models. We proceed as follows. We start off by reviewing SPNs. Afterwards, we introduce MSPNs and show how to learn tree-structured MSPNs from data using the Rényi Maximum Correlation Coefficient. Before concluding, we present our experimental evaluation.

## 2 Sum-Product Networks (SPNs)

Recent years have seen a significant interest in tractable probabilistic representations such as Arithmetic Circuits (ACs), see Choi and Darwiche (2017) for a discussion. In particular, SPNs, an instance of ACs, are deep probabilistic models that can represent high-treewidth models Zhao et al. (2015) and facilitate exact inference for a range of queries in time polynomial in the network size Poon and Domingos (2011); Bekker et al. (2015).

Definition of SPNs: Formally, an SPN is a rooted directed acyclic graph, comprising sum, product or leaf nodes. The scope of an SPN is the set of random variables appearing in the network. An SPN can be defined recursively as follows: (1) a tractable univariate distribution is an SPN; (2) a product of SPNs defined over different scopes is an SPN; and (3), a convex combination of SPNs over the same scope is an SPN. Thus, a product node in an SPN represents a factorization over independent distributions defined over different random variables, while a sum node stands for a mixture of distributions defined over the same variables. From this definition, it follows that the joint distribution modeled by such an SPN is a valid probability distribution, i.e., each complete and partial evidence inference query produces a consistent probability value Poon and Domingos (2011); Peharz et al. (2015). This also implies that we can construct multivariate distributions from simpler univariate ones. Furthermore, any node in the network could be replaced by any tractable multivariate distribution over the same scope, obtaining still a valid SPN.

Tractable Inference in SPNs: To answer probabilistic queries in an SPN, we evaluate the nodes starting at the leaves. Given some evidence, the probability output of querying leaf distributions is propagated bottom up. For product nodes, the values of the children nodes are multiplied and propagated to their parents. For sum nodes, instead, we sum the weighted values of the children nodes. The value at the root indicates the probability of the asked query. To compute marginals, i.e., the probability of partial configurations, we set the probability at the leaves for those variables to and then proceed as before. Conditional probabilities can then be computed as the ratio of partial configurations. To compute MPE states, we replace sum by max nodes and then evaluate the graph first with a bottom up pass, but instead of weighted sums we pass along the weighted maximum value. Finally, in a top down pass, we select the paths that lead to the maximum value, finding approximate MPE states Poon and Domingos (2011). All these operations traverse the tree at most twice and therefore can be achieved in linear time w.r.t. the size of the SPN.

Learning SPNs: While it is possible to craft a valid SPN structure by hand, doing so would require domain knowledge and weight learning afterwards Poon and Domingos (2011). Here, we focus on a top-down approach Lowd and Domingos (2008); Gens and Domingos (2013) that directly learns both the structure and weights of (tree) SPNs at once. It uses three steps: (1) base case, (2) decomposition and (3) conditioning. In the base case, if only one variable remains, the algorithm learns a univariate distribution and terminates. In the decomposition step, it tries to partition the variables into independent components such that and recurses on each component, inducing a product node. If both the base case and the decomposition step are not applicable, then training samples are partitioned into clusters (conditioning), inducing a sum node, and the algorithm recurses on each cluster.

This scheme for learning tree SPNs has been instantiated for several well-known distributions with parametric forms. Conditioning for Gaussians can be realized using hard clustering with EM or K-means Gens and Domingos (2013); Rooshenas and Lowd (2014). For Poissons, mixtures of Poisson Dependency Networks have been proven successful Molina et al. (2017). For the decomposition step, one typically employs pairwise independence tests with some associated independence score . For categorical variables, Gens and Domingos (2013) proposed to use the G-test, and Rooshenas and Lowd (2014) a pairwise mutual information test. For variables of the generalized linear model family, Molina et al. (2017) proposed the use of parameter instability tests based on generalized M-fluctuation processes. Then, one creates an undirected graph where there is an edge between random variables and if the value passes a threshold of significance . That is, the decomposition step equals to partitioning the graph into its connected components. It is rejected if there is only a single connected component.

## 3 Mixed Sum-Product Networks (MSPNs)

Unfortunately, all previous decomposition and conditioning approaches for SPNs are only suitable for multivariate distributions of known parametric form: categorical, binomial, Gaussian and Poisson distributions Poon and Domingos (2011); Vergari et al. (2015); Molina et al. (2017). To model hybrid domains without making parametric assumptions, one has to introduce new conditioning and decomposition approaches tailored towards mixed models.

Rényi Decomposition: We approach the problem of seeking independent subsets of random variables of mixed but unknown types as a dependency discovery problem. Alfred Rényi (1959) argued that a measure of maximum dependence between random variables and should satisfy several fundamental properties, such as symmetry, transformation invariance, and it should also hold that iff and are statistically independent. He also showed the Hirschfeld-Gebelein-Renyi (HGR) Maximum Correlation Coefficient due to Gebelein (1941) to satisfy all these properties. Recently, Lopez-Paz et al. (2013) provided a practical estimator for the HGR , the randomized dependency coefficient (RDC). The RDC is appealing for hybrid domains because it can be applied to both multivariate, continuous and discrete random variables. Also, its running time, with being the number of instances, makes it one of the fastest non-linear dependency measures.

The general idea behind the RDC is to look for linear correlations between the representations of two random samples that have undergone a series of non-linear transformations. The two samples are deemed statistically independent iff the transformed samples are linearly uncorrelated. This is the same reasoning behind the adoption of higher space projections for the kernel-trick in classification and the stacking of representations in deep architectures.

Specifically, consider two random samples and drawn from variables and , we decide that and are independent iff , where is the RDC.

Instead of operating directly on and , and in order to achieve invariance against scaling and shifting data transformations, we first compute their empirical copula transformations Póczos et al. (2012), and respectively , in the following way:

(1) |

Then, we apply a random linear projection on the obtained samples to a -dimensional space, finally passing them through a non-linear function . We compute:

(2) |

for the first sample, an equivalent transformation yields .

Note that , and that random sampling from a zero-mean -dimensional Gaussian is analogous to the use of a Gaussian kernel Rahimi and Recht (2009). We choose , to be sine function and as both have proven to be reasonable empirical heuristics, see Lopez-Paz et al. (2013). Lastly, we compute the canonical correlations (CCA) for and as the solutions for the following eigenproblem:

(3) |

where the covariance block matrices involved are:

In the end, the actual value for the RDC coefficient is the largest canonical correlation coefficient:

(4) |

This RDC pipeline goes through a series of data transformations, which constitutes the basis of our decomposition procedure, cf. Alg. 1. We note that all the transformations presented so far are easily generalizable to the multivariate case Lopez-Paz et al. (2013). We are applying these multivariate versions both when performing conditioning on multivariate samples (see below) and when we deal with decomposing categorical random variables. Since Eq. 1 is not well defined for categorical data, to treat them in the same way as continuous and discrete data, we proceed as follows. First we perform a one hot encoding transformation for each categorical random variables , obtaining a multivariate binary random variable . Then, we apply Eq. 1 to each column independently, obtaining the matrix . This way we are preserving all the modalities of . Finally, we apply the generalized version of Eq. 2 and Eq. 3 to the multivariate case.

Note that, while we are looking for the RDC to be zero in case of independent random variables, it is extremely unlikely for this to happen on real data samples. In practice, the thresholding approach on the adjacency graph induced by dependencies (see the previous section) takes care of this for the decomposition step.

Rényi Conditioning: The task of clustering hybrid data samples depends on the choice of the metric space, which in turn, typically depends on the parametric assumptions made for each variable. Consider e.g. the popular choice of K-Means using the Euclidean metric. It makes a Gaussian assumption and therefore is not principled for categorical data. To eliminate the reliance on knowing the type, we propose to cluster multivariate hybrid samples after they have been processed by the RDC pipeline. Not only does the series of non-linear transformations produce a feature space in which clusters may be more easily separable, but no distributional assumptions are required. More formally, given a set of samples over RVs we split it into a sample partitioning , , and . The weights for the convex combination on the sum nodes are estimated as the proportions of the data belonging to each cluster, i.e., .The procedure is sketched in Alg. 2. First, we transform every feature in using Eq 2: Then, all our features are projected into a new -dimensional non-linear space. In this new space we can safely apply now K-Means to obtain clusters. In Alg. 2, we set as this generally leads to deeper networks Vergari et al. (2015).

Nonparametric Univariate Leave Distributions: Finally, to be fully type agnostic, i.e., to realize MSPNs, we adopt piecewise polynomial approximations of the univariate leaf densities. The simplest and most straightforward approximation we consider are piecewise constant functions, i.e. histograms. More precisely, we adopt the scheme proposed in Rozenholc et al. (2010) offering an adaptive binning, i.e. with irregular intervals, that is learned from data by optimizing a penalized likelihood function. This allows MSPNs to model both multimodal and skewed univariate distributions without further assumptions. We apply Laplacian smoothing by a factor to cope with empty bins and unseen values on the distribution domain.

Indeed, by increasing the degree of leaf polynomial approximations, one can favor more expressive models. To balance between the complexity of learning resp. inference and expressiveness, however, we adopt more complex models up to piecewise linear approximations.

We reframe the unsupervised task of estimating the density of univariate leaf distributions into a supervised one by fitting a nonparametric unimodal distribution function through isotonic regression Frisen (1986), referred to as . Once we have collected a set of pairs of points, e.g. from the previously estimated histogram, we employ them as labeled observations to fit a monotonically increasing (resp. decreasing) piecewise linear function up to (resp. down from) the estimated distribution mode. Note that the unimodality assumption for leaves is realistic, since we can accommodate LearnMSPN, cf. Alg. 3, to grow a leaf only after no more clustering steps are possible, i.e. it is difficult if not impossible to separate two modalities in the observed data.

Now we have everything together to evaluate MSPNs empirically. Before doing so, we would like to stress that we here focused on a general setting. Instead of piecewise linear leaves, one can also employ existing hybrid densities as leave distributions such as HBNs, mixtures of truncated exponential families, or other nonparamteric density estimators such as Kernel Density Estimators (KDEs) and even denoising and variational autoencoders.

## 4 Experimental Evaluation

Our intention is to investigate the benefits of MSPNs compared to other mixed probabilistic models in terms of accuracy and flexibility of inference. Specifically, we investigate the following questions: (Q1) Is the MSPN distribution flexible for hybrid domains? (Q2) How do MSPNs compare to existing mixed models? (Q3) How do MSPNs compare to state-of-the-art parametric models in a single-type domain? (Q4) Can MSPNs be applied across a wide range of distributions and inference tasks? (Q5) Are there benefits of having tractable inference for hybrid domains, even via symbolic computation?

To this aim, we implemented MSPNs in Python calling R. All experiments ran on a Linux machine with 56 cores, 4 GPUs and 512GB RAM.

MSPN | |||||

Gower | RDC | ||||

dataset | hist | iso | hist | iso | |

anneal-U | -42.647 | -63.553 | -38.836 | -60.314 | -38.312 |

australian | -38.423 | -18.513 | -30.379 | -17.891 | -31.021 |

auto | -71.530 | -72.998 | -69.405 | -73.378 | -70.066 |

balance-scale | -7.483 | -8.038 | -7.045 | -7.932 | -7.302 |

breast | -30.572 | -34.027 | -23.521 | -34.272 | -24.035 |

breast-cancer | -9.193 | -15.373 | -9.500 | -16.277 | -9.990 |

cars | -28.596 | -30.467 | -31.082 | -29.132 | -30.516 |

cleave | -26.296 | -26.132 | -25.869 | -25.707 | -25.441 |

crx | -34.563 | -22.422 | -31.624 | -24.036 | -31.727 |

diabetes | -29.797 | -15.286 | -26.968 | -15.930 | -27.242 |

german | -34.356 | -40.828 | -33.480 | -38.829 | -32.361 |

german-org | -29.051 | -43.611 | -26.852 | -37.450 | -27.294 |

heart | -28.519 | -20.691 | -26.994 | -20.376 | -25.906 |

iris | -1.670 | -3.616 | -2.892 | -3.446 | -2.843 |

wins over | - | 4/14 | 11/14 | 4/14 | 11/14 |

wins | 3/14 | 11/14 |

Hybrid UCI Benchmarks (Q1, Q2):
We considered the 14 preprocessed UCI benchmarks from the MLC++ library^{1}^{1}1https://www.sgi.com/tech/mlc/download.html listed in Table 1. The domains span from survey data, to medical and biological domains, and they contain both continuous, discrete and categorical variables in different proportions.

As a baseline density estimator we considered HBNs whose conditional dependencies are modeled as conditional linear gaussians Heckerman and Geiger (1995). To learn their structure we explored both score-based and constrained-based approaches, finding the Max-Min Hill Climbing (MMHC) algorithm Tsamardinos et al. (2006) to perform the best on the holdout data. For weight learning, we optimized the BDeu score. As an additional sanity check of our nonparametric RDC pipeline, we also trained MSPNs employing K-Medoids using the Gower distance (GowerMSPNs). The Gower distance Gower and Gower (1971) defines a metric over hybrid domains, at the cost of making distributional assumptions for each variable involved: take the average of distances per feature . We assumed continuous variables to be Gaussian and discrete ones to be binomial.

The results are summarized in Table 1. MSPNs clearly outperform HBNs. Moreover, the performance of Rényi conditioning is comparable to GowerMSPNs. This shows that using RDC is a sensible idea and frees the user from making parametric assumptions. Using histogram representations allows one to capture mixtures, which turns out to be beneficial for some datasets, but also results in a higher variance in performance across datasets, giving a benefit to isotonic regression. This answers (Q1, Q2) affirmatively.

Learning Simplex Distributions (Q3): We considered data common in text and chemistry domains: proportional data, i.e., data lying on the probability simplex, the values are in and sum up to . The Dirichlet distribution is arguably the most popular parametric distribution for this type of data. Hence, we used it as baseline.

First, we considered the NIPS corpus, containing 1,500 documents over the 100 most frequent words. We ran Latent Dirichlet Allocation (LDA) Blei (2012) with different number of topics (3,5,10,20,50) and used the topics found as the data for our experiments. Fig. 2 shows that the MSPN fits the density well and actually the lower-left topic better.

Dimension | Dirichlet | MSPN(RDC,iso) | MSPN(Grower,iso) |
---|---|---|---|

NIPS + LDA | |||

3 | 2.045 ( 0.297) | 4.071 ( 0.66) | 4.333 ( 0.627) |

5 | 7.311 ( 0.406) | 10.376 ( 0.671) | 10.419 ( 0.711) |

10 | 25.047 ( 0.787) | 35.927 ( 1.755) | 34.205 ( 1.716) |

20 | 69.668 ( 2.014) | 109.222 ( 4.179) | 92.981 ( 4.245) |

50 | 245.008 ( 3.573) | 338.477 ( 6.976) | 349.259 ( 9.916) |

Air Quality + Archetypes | |||

3 | 2.939 ( 1.536) | 5.852 ( 2.261) | 7.114 ( 2.272) |

5 | 14.625 ( 4.678) | 16.494 ( 7.574) | 15.099 ( 4.888) |

10 | 61.317 ( 4.81) | 84.124 ( 6.575) | 85.645 ( 5.887) |

20 | 174.171 ( 5.799) | 232.075 ( 7.74) | 242.482 ( 10.224) |

Hydrochemicals | |||

12 | 59.546 ( 1.781) | 71.013 ( 3.591) | 82.377 ( 1.445) |

wins over Dir. | - | 10/10 | 10/10 |

wins | 0/10 | 10/10 |

Then, we investigated the Air Quality dataset^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Air+Quality. We used only complete instances and ignored the time features as well as feature C6H6 that has many missing instances containing 6,941 measurements for 12 features about air composition. We ran Archetypal Analysis Cutler and Breiman (1994); Thurau et al. (2012) for 3, 5, and 10 archetypes and extracted the archetypical convex reconstructions of the original data.

We also considered the hydro-chemical dataset of Tolosana-Delgado et al. (2005), containing 485 observations of 14 measurements of different chemicals for the Llobregat river in Spain. The relative concentrations are used to fit MSPNs and the Dirichlet distributions. The 10-fold cross-validated mean log-likelihoods for all models on the three datasets are summarized in Table 2. As one can see, in all cases MSPNs can capture the distribution on the simplex better than the Dirichlet. This is to be expected as MSPNs can capture more complex (in)dependencies, whereas the Dirichlet makes stronger independence assumptions. All simplex experiments together answer (Q3) affirmatively.

Leveraging symbolic-semantic information (Q4): Symbol grounding—the problem of how symbols get their meanings—is at the heart of AI, and we explored MSPNs as a step towards tackling this classicial AI problem. More precisely, we considered the 2828 MNIST gray digit images. We represented the digit as 16 continuous features extracted from an autoencoder (AE) trained on the MNIST training split: we stacked two layers of 256 and 128 rectifier neurons for both the encoder and the decoder and trained them for 200 epochs using adam as optimizer (learning rate , and coefficients set to resp. ,and no learning rate decay). To create a hybrid dataset, we then augmented MNIST with symbolic semantic information encoded as binary codes. Each bit of the code is 1 if a digit contains one of the following visual features: (i) a vertical stroke (true for 1, 4 and 7), (ii) a circle (0, 6, 8 and 9), (iii) a left curvy stroke (2, 3, 5, 8 and 9), (iv) a right curvy stroke (5 and 6), (v) a horizontal stroke (7, 2, 3, 4, and 5), (vi) a double curve stroke (3 and 8). That is, each class is encoded by a 6-bit code. For instance, images representing a “3” are assigned the code while corresponds to “5”. Additionally, we considered the original class variable as a third piece of information. Let denote the continuous embedding variables, the additional 6 binary symbolic features, and the categorical class variable.

In a first experiment, we trained an MSPN on a 10000 subsample of the augmented MNIST training data to model , setting and , . Then, we evaluated on the augmented MNIST test split whether the learned MSPN had captured the non explicit dependencies between the three different feature domains. First, we predict , for each visual code belonging to class . Fig. 3 (a) visualizes the prediction as decoded by the autoencoder back in pixel space. As one can see, the MSPN is not only able to recover the correct class but also does not simply memorize a training sample. An additional visual proof is provided by conditional sampling: after propagating bottom-up the evidence for an observed code , we sample a configuration (applying Vergari et al.’s (2016) top down approach). Decoded samples clearly belong to the class , cf. Fig. 3 (a). Then, to evaluate how good the MSPN was able to glue the continuous and binary domains, we performed conditional sampling starting from unseen visual codes. For instance, for the code , we expect a digit in between a “3” and a “5”, since it is merging the visual codes of these two classes. Fig. 3 (b) confirms this: decoded samples belong to either class or are closely “in between” them. Similarly, Fig. 3 (c) shows samples conditioned on code , in between class 5 and 1.

Next, we investigated how much symbol groundings can be helpful for density estimation and classification. On the MNIST test split, we investigated the benefit of using visual codes of length 2,4,8,16,32,64. We measured the improvement of the marginal likelihood resp. the classification accuracy based on of an MSPN trained on over an MSPN trained only over : for both measures . The results are summarized in Fig. 4. As one can see, increasing the number of symbolic features positively improves both the marginal likelihood over and the classification performance. Note that for computing and to predict , one has to marginalize over , which cannot be done efficiently using classical mixed graphical models.

Finally, we employed MSPNs for MNIST reconstruction. We processed the original images as two halves—left (l) and right (r), up (u) and down (d)—and encoded each half into 16 continuous features by learning one autoencoder independently for each one of them. Note that each variable set , , and forms a domain with a different distribution. Then, we learned MSPNs for and . We performed MPE inference to predict one half of a test image given the complementary one, e.g. left from right. Predicted samples are shown in Fig. 3 (d). As one can see, the reconstructions are indeed very plausible. This suggests that MSPNs are a valuable tool to effectively learn distributions and make predictions across different domains. All the experiments on leveraging symbolic-semantic information together answer (Q4) affirmatively.

Mixed Mutual Information (Q5): Recall, an MSPN encodes a polynomial over leaf piecewise polynomials. Consequently, one can employ a symbolic solver to evaluate the overall network polynomial to easily compute information-theoretic measures that would be difficult to compute otherwise, in particular for hybrid domains. To illustrate this for MSPNs, we consider computing mutual information (MI) in hybrid domains. MI also provides a way to extract the gist of MSPNs as it highlights relevant variable associations only. Fig. 5 shows the MI network induced over the Autism Dataset Deserno et al. (2016), which reflects natural semantic connections. This not only answers (Q5) affirmatively but also indicates that MSPNs may pave the way to automated mixed statisticians: the MI together with the tree structure of MSPNs can automatically be compiled into textual descriptions of the model; and interesting avenue for future work. To summarize our experimental results as a whole, all questions (Q1)-(Q5) can be answered affirmatively.

## 5 Conclusions

We introduced Mixed Sum-Product Networks (MSPNs), a novel combination of nonparametric probability distributions and deep probabilistic models. In contrast to classical shallow mixed graphical models, they provide effective learning, a range of tractable inferences and enhanced interpretability. Our experiments demonstrate that MSPNs are competitive to parameterized distributions as well as mixed graphical models and make previously difficult—if not impossible— to compute queries easy. Hence, they allow users to train multivariate mixed distributions more easily than previous approaches across a wide range of domains.

MSPNs suggest several avenues for future work: from learning boosted and mixtures of MSPNs along with exploring other nonparametric leaves such KDE, other mixed graphical models, and variational autoencoders, extending them to other instances of arithmetic circuits Choi and Darwiche (2017), and making use of weighted model integration solvers for capturing more complex types of queries Belle et al. (2015a); Morettin et al. (2017). Probably the most interesting avenue is to turn MSPNs into automated statisticians, able to predict the statistical type of a variable—is it continuous or ordinal?—and ultimately its parametric form—is it Gaussian or Poisson Valera and Ghahramani (2017)?

Acknowledgements: This work is motivated and partly supported by the BMEL/BLE project DePhenSe, FKZ 313-06.01-28-1-82.047-15. AM has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”, projects B4. SN has been supported by the CwC Program Contract W911NF-15-1-0461 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO).

## References

- Bekker et al. (2015) Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche, and Guy Van den Broeck. Tractable learning for complex probability queries. In Proc. of NIPS, 2015.
- Belle et al. (2015a) Vaishak Belle, Andrea Passerini, and Guy Van den Broeck. Probabilistic inference in hybrid domains by weighted model integration. In In Proc. of IJCAI, pages 2770–2776, 2015a.
- Belle et al. (2015b) Vaishak Belle, Guy Van den Broeck, and Andrea Passerini. Hashing-based approximate probabilistic inference in hybrid domains. In UAI, pages 141–150, 2015b.
- Blei (2012) David M Blei. Probabilistic topic models. CACM, 55(4):77–84, 2012.
- Cheng et al. (2014) Wei-Chen Cheng, Stanley Kok, Hoai Vu Pham, Hai Leong Chieu, and Kian Ming Adam Chai. Language modeling with Sum-Product Networks. In Proc. of Interspeech, 2014.
- Choi and Darwiche (2017) Arthur Choi and Adnan Darwiche. On relaxing determinism in arithmetic circuits. In Proceedings of ICML, pages 825–833, 2017.
- Cutler and Breiman (1994) Adele Cutler and Leo Breiman. Archetypal analysis. Technometrics, 36(4):338–347, 1994.
- Deserno et al. (2016) Marie K Deserno, Denny Borsboom, Sander Begeer, and Hilde M Geurts. Multicausal systems ask for multicausal approaches: A network perspective on subjective well-being in individuals with autism spectrum disorder. Autism, page 1362361316660309, 2016.
- Elidan (2010) Gal Elidan. Copula bayesian networks. In Proc. of NIPS, 2010.
- Frisen (1986) M. Frisen. Unimodal regression. Journal of the Royal Statistical Society. Series D, 35(4):479–485, 1986. ISSN 00390526, 14679884. URL http://www.jstor.org/stable/2987804.
- Gebelein (1941) H. Gebelein. Das statistische Problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. Zeitschrift fur Angewandte Mathematik und Mechanik, 21(6):364–379, 1941.
- Gens and Domingos (2013) Robert Gens and Pedro Domingos. Learning the Structure of Sum-Product Networks. In Proc. of ICML, 2013.
- Gower and Gower (1971) J. C. Gower and J. C. Gower. A general coefficient of similarity and some of its properties. Biometrics, 1971.
- Haslbeck and Waldorp (2015) J. M. B. Haslbeck and L. J. Waldorp. mgm: Estimating time-varying mixed graphical models in high-dimensional data. ArXiv, 1510.06871, 2015.
- Heckerman and Geiger (1995) David Heckerman and Dan Geiger. Learning bayesian networks: a unification for discrete and gaussian domains. In Proc. of UAI, 1995.
- Langseth et al. (2012) Helge Langseth, Thomas D Nielsen, Rafael Rumı, and Antonio Salmerón. Mixtures of truncated basis functions. International Journal of Approximate Reasoning, 53(2):212–227, 2012.
- Lauritzen and Wermuth (1989) S.L. Lauritzen and N. Wermuth. Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals of Statistics, 17(1):31–57, 1989.
- Lopez-Paz et al. (2013) David Lopez-Paz, Philipp Hennig, and Prof. Bernhard Schölkopf. The randomized dependence coefficient. In Proc. of NIPS. 2013.
- Lowd and Domingos (2008) Daniel Lowd and Pedro Domingos. Learning arithmetic circuits. In Proc. of UAI, 2008.
- Molina et al. (2017) Alejandro Molina, Sriraam Natarajan, and Kristian Kersting. Poisson sum-product networks: A deep architecture for tractable multivariate poisson distributions. In Proc. of AAAI, 2017. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14530.
- Moral et al. (2001) Serafín Moral, Rafael Rumi, and Antonio Salmerón. Mixtures of truncated exponentials in hybrid bayesian networks. In Salem Benferhat and Philippe Besnard, editors, Proc. of ECSQARU, 2001.
- Morettin et al. (2017) Paolo Morettin, Andrea Passerini, and Roberto Sebastiani. Efficient weighted model integration via smt-based predicate abstraction. In Proc. of IJCAI, 2017.
- Peharz et al. (2015) Robert Peharz, Sebastian Tschiatschek, Franz Pernkopf, and Pedro Domingos. On theoretical properties of sum-product networks. In Proc. of AISTATS, 2015.
- Póczos et al. (2012) Barnabás Póczos, Zoubin Ghahramani, and Jeff G. Schneider. Copula-based kernel dependency measures. arXiv 1206.4682, 2012.
- Poon and Domingos (2011) Hoifung Poon and Pedro Domingos. Sum-Product Networks: a New Deep Architecture. Proc. of UAI, 2011.
- Rahimi and Recht (2009) Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Proc. of NIPS, 2009.
- Rényi (1959) Alfréd Rényi. On measures of dependence. Acta mathematica hungarica, 10(3-4):441–451, 1959.
- Rooshenas and Lowd (2014) Amirmohammad Rooshenas and Daniel Lowd. Learning sum-product networks with direct and indirect variable interactions. In Proc. of ICML, pages 710–718, 2014.
- Rozenholc et al. (2010) Yves Rozenholc, Thoralf Mildenberger, and Ursula Gather. Combining regular and irregular histograms by penalized likelihood. Comp. Statistics & Data Analysis, 54(12):3313–3323, 2010.
- Sanner and Abbasnejad (2012) S. Sanner and E. Abbasnejad. Symbolic variable elimination for discrete and continuous graphical models. In Proc. of AAAI, 2012.
- Shenoy and West (2011) P.P. Shenoy and J.C. West. Inference in hybrid bayesian networks using mixtures of polynomials. International Journal of Approximate Reasoning, 52(5):641–657, 2011.
- Thurau et al. (2012) C. Thurau, K. Kersting, M. Wahabzada, and C. Bauckhage. Descriptive matrix factorization for sustainability adopting the principle of opposites. DAMI, 24(2):325–354, 2012.
- Tolosana-Delgado et al. (2005) R Tolosana-Delgado, N Otero, V Pawlowsky-Glahn, and A Soler. Latent compositional factors in the llobregat river basin (spain) hydrogeochemistry. Math. Geology, 37(7):681–702, 2005.
- Tsamardinos et al. (2006) Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. MLJ, 65(1):31–78, 2006.
- Valera and Ghahramani (2017) Isabel Valera and Zoubin Ghahramani. Automatic discovery of the statistical types of variables in a dataset. In ICML, 2017. URL http://proceedings.mlr.press/v70/valera17a.html.
- Vergari et al. (2015) Antonio Vergari, Nicola Di Mauro, and Floriana Esposito. Simplifying, Regularizing and Strengthening Sum-Product Network Structure Learning. In Proc. of ECML-PKDD, 2015.
- Vergari et al. (2016) Antonio Vergari, Nicola Di Mauro, and Floriana Esposito. Visualizing and understanding sum-product networks. arXiv 1608.08266, 2016. URL https://arxiv.org/abs/1608.08266.
- Yang et al. (2014) E. Yang, Y. Baker, P. Ravikumar, G.I. Allen, and Z. Liu. Mixed graphical models via exponential families. In Proc. of AISTATS, 2014.
- Zhao et al. (2015) Han Zhao, Mazen Melibari, and Pascal Poupart. On the Relationship between Sum-Product Networks and Bayesian Networks. In Proc. of ICML, 2015.