Probabilistic Deep Learning using Random SumProduct Networks
Abstract
Probabilistic deep learning currently receives an increased interest, as consistent treatment of uncertainty is one of the most important goals in machine learning and AI. Most current approaches, however, have severe limitations concerning inference. SumProduct networks (SPNs), although having excellent properties in that regard, have so far not been explored as serious deep learning models, likely due to their special structural requirements. In this paper, we make a drastic simplification and use a random structure which is trained in a “classical deep learning manner” such as automatic differentiation, SGD, and GPU support. The resulting models, called RATSPNs, yield comparable prediction results to deep neural networks, but maintain wellcalibrated uncertainty estimates which makes them highly robust against missing data. Furthermore, they successfully capture uncertainty over their inputs in a convincing manner, yielding robust outlier and peculiarity detection.
Probabilistic Deep Learning using Random SumProduct Networks
Robert Peharz Dept. of Engineering University of Cambridge Martin Trapp Austrian Research Institute for Artificial Intelligence Antonio Vergari Max Planck Institute for Intelligent Systems Kristian Kersting Computer Science Dept. TU Darmstadt Karl Stelzner Computer Science Dept. TU Darmstadt Zoubin Ghahramani Dept. of Engineering University of Cambridge Alejandro Molina Computer Science Dept. TU Darmstadt
1 Introduction
Dealing with uncertainty clearly is one of the most important aspects of machine learning and AI. An intelligent system should be able to deal with uncertain inputs (e.g. missing features) as well as express its uncertainty over outputs. Especially the latter is a crucial point in decisionmaking processes such as in medical diagnosis and planning systems for autonomous agents. It is, therefore, no surprise that probabilistic approaches have recently gained tremendous momentum also in deep learning, the currently predominant branch in machine learning. Examples of probabilistic deep learning systems are variational autoencoders (VAEs) [14], deep generative models [24], generative adversarial nets (GANs) [11], neural autoregressive density estimators (NADEs) [15], and PixelCNNs/RNNs [30].
However, most of these probabilistic deep learning approaches have limited capabilities when it comes to inference. Implicit likelihood models like GANs, even when successful in capturing the data distribution, do not allow to evaluate the probability of a test sample. Similar problems arise in deep generative models and VAEs, which typically use an inference network to infer the posterior over a latent variable space. However, inference in both these models and—ironically—also their inference networks is limited to drawing samples, which forces users to retreat to Monte Carlo estimates. NADEs and PixelCNNs/RNNs, both instances of autoregressive density estimators, allow to efficiently evaluate sample likelihoods and even certain marginalization and conditioning tasks if the marginalized/conditioned variables appear first in the assumed variable ordering. Otherwise, inference is rendered intractable. Uria et al. [29] addresses this problem by training an ensemble of NADEs with shared network structure. This approach, however, introduces the delicate problem of approximately training a superexponential ensemble of NADEs.
SumProduct Networks (SPNs) are a class of probabilistic models with a crucial advantage over the models above [22]: they permit exact and efficient inference. More precisely, they are able to compute any marginalization and conditioning query in time linear of the model’s representation size. However, although SPNs can be described in a nutshell as “deep mixture models” [19], they have received rather limited attention in the deep learning literature, despite their attractive inference properties. We identify three reasons for this situation. First, the structure of SPNs needs to obey certain constraints, requiring either careful structure design by hand or learning the structure from data [6, 9, 18, 25, 31, 1, 28]. Second, the parameter learning schemes proposed so far are either inspired by graphical models [22, 34, 19] or are tailored to SPNs [8]. These peculiarities concerning structure and parameter learning probably hindered a wide application of SPNs in the connectionist approach so far. Third, there seems to be a folklore that SPNs are “somewhat weak function approximators”, i.e., it is widely believed that SPNs are less suitable to solve prediction tasks to an extent we expect from deep neural networks. However, this belief is not theoretically grounded. SPNs inherit universal approximation properties from mixture models — as a mixture model is simply a “shallow” SPN with a single sum node. Consequently, SPNs should in theory also able to represent any prediction function via probabilistic inference.
In this paper, we empirically demystify this folklore and investigate the fitness of SPNs as deep learning models. To this aim, we introduce a novel and particularly simple way to construct SPNs, waiving the necessity for structure learning. Our SPNs are obtained by first constructing a random region graph [6, 18] laying out the overall network design. Subsequently, the region graph is populated with tensors of SPN nodes, which allows an easy mapping on deep learning frameworks such as TensorFlow [7]. Consequently, our models—called Random Tensorized SPNs (RATSPNs)—can be optimized in an endtoend fashion, using standard deep learning techniques such as automatic differentiation, adaptive SGD optimizers, and automatic GPUparallelization. To avoid overfitting, we adopt the wellknown dropout heuristic [26], which yields an elegant probabilistic interpretation as marginalization of missing features (dropout at inputs) and as injection of discrete noise (dropout at sum nodes). We trained RATSPNs on several realworld classification data sets, showing that their prediction performances are comparable to traditional deep neural networks. At the same time, RATSPNs specify a complete distribution over both inputs and outputs, which allows us to treat uncertainty in a consistent and efficient manner. First, we show that RATSPNs are dramatically more robust against missing features than neural networks. Second, we show that RATSPNs also provide wellcalibrated uncertainty estimates over their inputs, i.e., the model “knows what it does not know”, which can be exploited for anomaly and outofdomain detection.
2 Related Work
We denote random variables (RVs) by uppercase letters, e.g. , , and their values as corresponding lowercase letters, e.g. . Similarly, we denote sets of RVs as , and their combined values as , .
An SPN over is a probabilistic model defined via a directed acyclic graph (DAG) containing three types of nodes: input distributions, sums and products. All leaves of the SPN are distribution functions over some subset . When we know that a node is a leaf, we also use the explicit symbol . Inner nodes are either weighted sums or products, denoted as and , respectively, i.e., and , where denotes the children of . The sum weights are assumed to be nonnegative and normalized, i.e., , .
The scope of an input distribution is defined as the set of RVs for which is defined: . The scope of an inner node is recursively defined as . To allow efficient inference, SPNs are required to fulfill two structure constraints [5, 22], namely completeness and decomposability. An SPN is complete if for each sum it holds that , for each . An SPN is decomposable if it holds for each product that , for each . In that way, all nodes in an SPN recursively define a distribution over their respective scopes: the leaves are distributions by definition, sum nodes are mixtures of their child distributions, and products are factorized distributions, i.e., assuming independence among the scopes of their children.
Besides representing probability distributions, the crucial advantage of SPNs is that they allow efficient inference: In particular, any marginalization task reduces to the corresponding marginalizations at the leaves (each leaf marginalizing only over its scope), and recursively evaluating the internal nodes in a bottomup pass [21]. Thus, marginalization in SPNs follows essentially the same procedure as evaluating the likelihood of a sample—both scale linearly in the SPN’s representation size (assuming tractable marginalization at the leaves). Conditioning can be tackled in a similar manner.
Learning the parameters of SPNs, i.e., the sum weights and the parameters of input distributions, can be addressed in various ways. By interpreting the sum nodes as discrete latent variables [22, 35, 19], SPNs can be trained using the classical expectationmaximization (EM) algorithm. “Hard versions” of EM and gradient descent have been proposed in [22, 8]. Gens and Domingos [8], e.g., trained SPNs using a discriminative objective, achieving then stateoftheart classification results on image benchmarks. However, the SPN structure employed there was rather shallow and relied on a rich and handcrafted feature extraction. Bayesian learning schemes have been proposed in [23, 34]. Zhao et al. [36] derived a concaveconvex procedure, which interestingly coincides with the EM updates for sumweights. Subsequently, Trapp et al. [27] introduced a safe semisupervised learning scheme for discriminative and generative parameter learning, providing guarantees for the performance in the semisupervised case. Vergari et al. [33] extended SPNs to representation learning, exploiting SPN inference as encoding and decoding routines.
The structure of SPNs can be crafted by hand [22, 20] or learned from data. Most structure learners [25, 31, 1, 16] can be framed as variations of the prototypical topdown scheme LearnSPN due to Gens and Domingos [9]. It recursively splits the data via clustering (to determine sum nodes) and independence tests (for product nodes). The high cost of these repeated splits makes structure learning the bottleneck in training SPNs. In the present paper, we make a drastic simplification by picking a scalable random structure and optimizing its parameters with available deep learning tools.
3 Random Tensorized SumProduct Networks
To construct randomandtensorized SPNs (RATSPNs) we use a region graph [22, 6, 18] as an abstracted representation of the network structure. Given a set of RVs , a region is defined as any nonempty subset of . Given any region , a partition of is a collection of nonempty, nonoverlapping subsets of , whose union is again , i.e., , , , . Specifically, we here consider only 2partitions, which will cause all product nodes in our SPNs to have exactly two children. This assumption, often made in the SPN literature, simplifies SPN design and does not impair performance [31].
Now, a region graph over is a DAG whose nodes are regions and partitions such that the following holds:

is a region in and has no parents (root region). All other regions have at least one parent.

All children of regions are partitions and all children of partitions are regions (i.e., is bipartite).

If is a child of , then .

If is a child of , then .
From this definition it is evident that a region graph dictates a hierarchical partition of the scope . We denote regions which have no child partitions as leaf regions.
Given a region graph, we can construct a corresponding SPN, as illustrated in Alg. 1. Here, each of the classes is represented by a sum node in the root region. is the number of input distributions per leaf region, and is the number of sum nodes in regions, which are neither leaf nor root regions. It is easy to verify that this scheme leads to a complete and decomposable SPN.
Within this region graph SPN framework, we are able to deal both with multiclass classification—each sum node in the root region represent a class conditional, the classes sharing the SPN structure below—and density estimation—in which case is simply .
3.1 Random Region Graphs
To construct now random region graphs and in turn RATSPNs, we follow Alg. 2. We randomly divide the root region into two subregions of equal size (breaking ties in case of an odd number of RVs) and proceed recursively down to depth , resulting in an SPN of depth . This recursive splitting mechanism is repeated times. Fig. 1 shows an SPN for classification built following Alg. 2.
Moreover, this construction scheme yields (RAT)SPNs where input distributions, sums, and products can be naturally organized in alternating layers. Similar to classical multilayer perceptrons (MLPs), each layer takes inputs from its directly preceding layer only. Unlike MLPs, however, layers in RATSPNs are connected blockwise sparsely in a random fashion. Thus, layers in MLPs and RATSPNs are hardly comparable; however, we suggest to understand each pair of sum and product layer to be roughly corresponding to one layer in an MLP: sum layers play the role of (sparse) matrix multiplication and product layers as nonlinearities (or, more precisely, bilinearities of their inputs). Indeed, RATSPNs are similar in spirit to the reparametrization of SPNs as MLPs considered by Vergari et al. [32]; however, our constructions here combines nodes in blocks and reduces the overall sparseness.
3.2 Training and Implementation
Let be a training set of inputs and class labels . Furthermore, let denote the output of the RATSPN and all SPN parameters. We train RATSPNs by minimizing the objective
(1) 
where is the crossentropy
(2) 
and denotes the normalized negative loglikelihood
(3) 
By setting , we purely train on crossentropy (discriminative setting), while for we perform pure maximum likelihood training (generative setting). For , we have a continuum of hybrid objectives, trading off the generative and discriminative character of the model.
We implemented RATSPNs in Python/TensorFlow, where each region in our region graph is associated a matrix with as many rows as the used batch size (we used a batch size of consistently). Each column representing one distribution in the regions, i.e., , and in input regions, internal regions and the root region, respectively. We perform all computation in the domain. As it is common, multiplying small probability values in the linear domain quickly approaches zero, making the computations prone to underflows. Therefore, we practically replace product nodes with additions, and sum nodes with logsumexp operations, employing the frequently used “trick” to compute via . This function is readily provided in Tensorflow.
Implementing RATSPNs in TensorFlow allows us to optimize our objective using automatic differentiation and offtheshelf gradientbased optimizers. Throughout our experiments, we used Adam [13] in its default settings. As input distributions, we used Gaussian distributions with isotropic covariances, i.e., each input distribution reduces to a product layer combining single dimensional Gaussians with shared variances. We tried to optimize the variances jointly with the means which, however, delivered worse results than merely setting all variances uniformly to . We conjecture that Adam might not be wellsuited to optimize variances, as optimization schemes like EM have no problem in this case [19]. While RATSPNs are implemented and trained in a seemingless way, they unfortunately yield hundreds of tensors, which is a nonoptimal layout in TensorFlow. This, together with performing computations in the logdomain, causes that RATSPNs are approximately an order of magnitude slower to train than ReLUMLPs of similar sizes. Note, however, that this disadvantage is mostly caused by the current state of hardware and software development for deep learning; it is not of principal nature.
3.3 Probabilistic Dropout
The size of RATSPNs can be easily controlled via the structural parameters , , and . RATSPNs with many parameters, however, tend to overfit—just like regular neural networks—which requires regularization. One of the classical techniques that boosted deep learning models is Srivastava et al.’s dropout heuristic [26]. It sets inputs and/or hidden units to zero with a certain probability , and multiplies the remaining layer outputs with . In the following we modify the dropout heuristic, proposing two variants for RATSPNs, exploiting their probabilistic nature.
3.3.1 Dropout at Inputs: Marginalizing out Inputs
Dropout at inputs essentially marks input features as missing at random. In the probabilistic paradigm, we would simply marginalize over these missing features. Fortunately, this is an easy exercise in SPNs, as we only need to set the distributions corresponding to the droppedout features to . As we operate in the logdomain, this means to set the corresponding logdistribution nodes to . This is in fact quite similar to standard dropout, except that we are not compensating by , and blocks of units are dropped out (i.e., all logdistributions whose scope corresponds to a missing input feature are jointly set to ).
3.3.2 Dropout at Sums: Injection of Discrete Noise
As discussed in [22, 35, 19], sum nodes in SPNs can be interpreted as marginalized latent variables, akin to the latent variable interpretation in mixture models. In particular, [19] introduced socalled augmented SPNs which explicitly incorporate these latent variables in the SPN structure. The augmentation first introduces indicator nodes representing the states of the latent variables, which can switch the children of sum nodes on or off by connecting them via an additional product. This mechanism establishes the explicit interpretation of sum children as conditional distributions. In the case that completeness of the resulting SPN is impaired, additional sum nodes (twin sums) are introduced to complete the probabilistic model. See the discussion of Peharz et al. [19] for more details.
In RATSPNs, we can equally well interpret a whole region as a single latent variable, and the weights of each sum node in this region as the conditional distribution of this variable. Indeed, as is easily checked, the argumentation in [19] also holds when introducing a set of indicators for a single latent variable which is shared by all sum nodes in one region, as they all have the same scope and the same children. While the latent variables are not observed, we can employ a simple probabilistic version of dropout, by introducing artificial observations for them. For example, if the sum nodes in a particular region have children (i.e. the corresponding variable has states), then we could introduce artificial information that assumes a state in some subset of . By doing this for each latent variable in the network, we essentially select a small substructure of the whole SPN to explain the data—this argument is very similar to the original dropout proposal [26].
In any case, implementing dropout at sumlayers is again straightforward: we select a subset of all product nodes which are connected to the sums in one region and set them to 0 (actually in the logdomain). Again we do not multiply with a correction factor.
4 Experiments
In the following we investigate the fitness of RATSPNs as deep learning models. Furthermore, we highlight their advantages when used as generative model.
4.1 Exploring the Capacity of RatSpns
We start off by exploring the capacity of RATSPNs as function approximators for classification. A simple way to assess model capacity is trying to overfit the training data with various model sizes. To this end, we fit RATSPNs on the MNIST^{1}^{1}1yann.lecun.com/exdb/mnist train data, using every combination of split depth , number of split repetitions and number of distributions per region . As we are in the dataagnostic setting, the natural baselines are MLPs, where we take ReLU activations for the hidden units and linear activations for the output layer. We ran MLPs with every combination of number of layers in and number of hidden units in . For both RATSPNs and MLPs, we used Adam with its default parameters to optimize crossentropy (i.e., for RATSPNs).
Figure 2 summarizes the training accuracy of both models after 200 epochs as a function of the number of parameters in the respective model. As one can see see, RATSPNs can scale to millions of parameters, and furthermore, they are easily able to overfit the MNIST training set, to the same extent as MLPs. While for numbers of layers it seems that RATSPNs are suited slightly better to fit the data, this is in fact only an artifact of SGD optimization: MLPs still jitter around during the last epochs, while the accuracy of RATSPNs remains stable.
These overfitting results indicate that RATSPNs are capacitywise at least as powerful as ReLUMLPs. In the next experiment, we investigated whether RATSPNs are also on par with MLPs concerning generalization on classification tasks. Subsequently, we investigated whether RATSPNs, due to their probabilistic nature, exhibit superior performance when dealing with missing features and identifying outliers reliably.
4.2 Generalization of RatSpns
When trained without regularization, RATSPNs achieve less than on the test set of MNIST, which is rather inferior even for dataagnostic models. Therefore, we trained them with our probabilistic dropout variant as introduced in Section 3.3. We crossvalidated , and number of distributions per region , and applied dropout rates for inputs in and for sumlayers in . A dropout rate of means that a fraction of features is actually kept.
For comparison, we trained ReLUMLPs with number of hidden layers in , number of hidden units in , input dropout rates in and dropout rates for hidden layers in . No dropout was applied to the output layer. We trained MLPs in two variants, namely ’vanilla’ (vMLPs), meaning that besides dropout no additional optimization tricks were applied, and a variant (MLP) also employing Xavierinitialization [10] and batch normalization [12]. While MLP should be considered the default variant to train MLPs, one should notice that helpful heuristics like Xavierinitialization and batch normalization have evolved over decades, while similar techniques for RATSPNs are not yet available. Thus, vMLPs might serve as a fairer comparison.
RATSPN  MLP  vMLP  

Accuracy 
MNIST  98.19  98.32  98.09 
(8.5M)  (2.64M)  (5.28M)  
FMNIST  89.52  90.81  89.81  
(0.65M)  (9.28M)  (1.07M)  
20NG  47.8  49.05  48.81  
(0.37M)  (0.31M)  (0.16M)  
CrossEntropy 
MNIST  0.0852  0.0874  0.0974 
(17M)  (0.82M)  (0.22M)  
FMNIST  0.3525  0.2965  0.325  
(0.65M)  (0.82M)  (0.29M)  
20NG  1.6954  1.6180  1.6263  
(1.63M)  (0.22M)  (0.22M) 
We trained on MNIST, fashionMNIST^{2}^{2}2FashionMNIST is a dataset in the same format as MNIST, but with the task of classifying fashion items rather than digits; github.com/zalandoresearch/fashionmnist and 20 News Groups (20NG).^{3}^{3}3scikitlearn.org/stable/datasets/twenty_newsgroups.html The 20NG dataset is a text corpus of 18846 news documents that belong to 20 different news groups or classes. We first split the news documents into 13568 instances for training, 1508 for validation, and 3770 for testing. The text was preprocessed into a bagofwords representation by keeping the top 1000 most relevant words according to their TfIDF. Then, 50 topics were extracted by LDA [2] and employed as the new feature representation for classification.
Table 1 summarizes the classification accuracy and crossentropy on the test set, as well as the size of the models in terms of number of parameters. As one can see, RATSPNs are on par with MLPs, only slightly outperformed in terms of traditional classification tasks. Note, however, that our approach to set SPNs in a classical connectionist setting is rather simple; this, together with our capacity analysis in Section 4.1 indicates the potential of SPNs as prediction models. Moreover, as the following section shows, the real potential of probabilistic deep learning models actually lies beyond classical benchmark results.
4.3 Hybrid PostTraining
Recall that SPNs define a full distribution over both inputs and class variable , and that our objective (1) with parameter allows us to trade off between crossentropy () and loglikelihood (). When , we cannot hope that the distribution is faithful to the underlying data. By setting , however, we can obtain interesting hybrid models, yielding both a discriminative and generative behavior. To this end, we use the RATSPN with highest validation accuracy, and posttrain it for another 20 epochs, for various values of . This yields a natural tradeoff between the loglikelihood over inputs and predictive performance regarding classification accuracy/crossentropy. Figure 3 shows this tradeoff. As one can see, by sacrificing little predictive performance, we can drastically improve the generative character of SPNs. The benefit of this is shown in the following.
4.4 Spns Are Robust Against Missing Features
When input features in are missing at random, the probabilistic paradigm offers a clear solution: the marginalization of missing features. As SPNs allow marginalization simply and efficiently, we expect that RATSPNs should be able to robustly treat missing features, especially the “more generative” they are (corresponding to smaller ). To this end, we randomly discard a fraction of pixels in the MNIST test data—independently for each sample—and classify the data using RATSPNs trained with various values of , marginalizing missing features. This is the same procedure we used for probabilistic dropout during training, cf. Section 3.3. Similarly, we might expect MLPs to perform robustly under missing features during test time, by applying (classical) dropout.
Figure 4 summarizes the classification results when varying between and . As one can see, RATSPNs with smaller are more stable against even large fractions of missing features. A particularly interesting choice is : here the corresponding RATSPN starts with an accuracy for no missing features and degrades very gracefully: for a large fraction of missing features () the advantage over MLPs is dramatic. Note that this result is consistent with other hybrid learning schemes applied in graphical models [18]. Purely discriminative RATSPNs and MLPs are roughly on par concerning robustness against missing features.
4.5 Spns Know What They Don’t Know
Besides being robust against missing features, an important feature of (hybrid) generative models is that they can in principle detect outliers and peculiarities by monitoring the likelihood over the inputs. To this end, we evaluated the likelihoods on the test set of both MNIST and fashionMNIST evaluated on the respective RATSPN posttrained with . We selected two thresholds of and by visual inspection of the histograms over likelihoods of inputs. These two values determine roughly the percentiles of most likely/unlikely samples. In both these sets, we selected—following the original order in MNIST—the first 10 samples which are correctly and incorrectly classified, respectively. We thus got 4 groups of 10 samples each: outlier/correct, outlier/incorrect, inlier/correct, inlier/incorrect.
These samples are shown in Figure 5. Albeit qualitative, these results are interesting: One can visually confirm that the outlier MNIST digits are indeed peculiar, both the correctly and the incorrectly classified ones. Among the outlier/incorrect group are 2 samples (top row, right, 3rd and 8th), which are not recognizable to the authors either. The inlier/incorrect digits can be interpreted—with some care and a grain of salt—as the ambiguous ones, e.g. two ’2’s (bottom row, right, 5th and 6th) are similar to ’7’ (and indeed classified as such), or a digit (bottom row, right, 8th) which could either be or . For fashionMNIST, one can clearly see that the outliers are all low in contrast and fill the whole image. In one images (top row, right, 9th) the background has not been removed.
More objectively, we use Bradshaw et al.’s Transfer Testing (TT), a technique to assess the calibration of uncertainties in probabilistic models [3]. TT is quite simple: we feed a classifier trained on one domain (e.g. MNIST) with examples from a related but different domain (e.g. street view house numbers (SVHN) [17] or the handwritten digits of SEMEION [4]). While we would expect that most classifiers perform poorly in such setting, an important property of an AI system would be to be aware that it is confronted with outofdomain data and be able to communicate this either to other parts of the system or a human user. While Bradshaw et al. applied TT to conditional models, i.e., over output uncertainties, a more natural approach would be to apply it to input likelihoods, if available, such as in SPNs.
Figure 6, top, shows histograms of the loglikelihoods of the RATSPN posttrained with , when fed with MNIST test data (indomain), SVHN test data (outofdomain) and SEMEION (outofdomain). The result is striking: the histogram shows that the likelihood over inputs provides a strong signal (note the yaxis logscale) whether a sample comes from indomain or outofdomain. That is, RATSPNs have an additional communication channel—the likelihood over the inputs—to tell us whether they are confident with their predictions.
An MLP, as a nonprobabilistic model, does not have such a mean. As a sanity check, however, we mimic the same computations performed in RATSPNs to obtain a loglikelihood: adding to each output (we assume uniform classprior) and computing logsumexp of the result. One might suspect, that the result, although not interpretable as logprobability in MLPs, still yields a decent measure of confidence. In need of a name for this rather odd quantity, we name it mocklikelihood. Figure 6, bottom, shows histograms of this mocklikelihood: although histograms are more spread for outofdomain data, they are highly overlapping, yielding no clear signal for outofdomain vs. indomain.
Taking all experimental results together, (RAT)SPNs are powerful deep learning models. They are robust function approximators as well as probability estimators for arbitrary inputs and outputs, with fast and exact inference.
5 Conclusion
We introduced a particularly simple but effective way to train SPNs: simply pick a random structure and train them in endtoend fashion like neural networks. This makes the application of SPNs within the deep learning framework seamless and allows the application of common deep learning tools such automatic differentiation and easy use of GPUs. As a modest technical contribution, we adapted the wellknown dropout heuristic and equipped it with a sound probabilistic interpretation within RATSPNs. RATSPNs showed a performance on par with traditional neural networks on several classification tasks. Moreover, RATSPNs demonstrate their full power when used as a generative model, showing remarkable robustness against missing features through exact and efficient inference and compelling results in anomaly/outofdomain detection. In future work, the hybrid properties of RATSPNs could allow promising directions like new variants of semisupervised or active learning. While this paper is held in the dataagnostic regime, in future we will investigate SPNs tailored to structured data sources.
References
 [1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sumproduct networks via an SVDbased algorithm. In UAI, 2015.
 [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3, 2003.
 [3] J. Bradshaw, A. Matthews, and Z. Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. preprint arXiv, 2017. arxiv.org/abs/1707.02476.
 [4] M. Buscema. MetaNet*: The Theory of Independent Judges, volume 33. 02 1998.
 [5] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50(3):280–305, 2003.
 [6] A. Dennis and D. Ventura. Learning the architecture of sumproduct networks using clustering on variables. In Proceedings of NIPS, 2012.
 [7] Abadi M. et al. (40 authors). TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [8] R. Gens and P. Domingos. Discriminative learning of sumproduct networks. In Proceedings of NIPS, pages 3248–3256, 2012.
 [9] R. Gens and P. Domingos. Learning the structure of sumproduct networks. Proceedings of ICML, pages 873–880, 2013.
 [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS, pages 249–256, 2010.
 [11] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of NIPS, pages 2672–2680, 2014.
 [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Prooceedings of ICML, 2015.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of ICLR, 2015.
 [14] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014. arXiv:1312.6114.
 [15] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of AISTATS, pages 29–37, 2011.
 [16] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sumproduct networks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.
 [17] Y. Netzer, T. Wang, A. Coates, ABissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [18] R. Peharz, B. Geiger, and F. Pernkopf. Greedy partwise learning of sumproduct networks. In Proceedings of ECML/PKDD, pages 612–627. Springer Berlin, 2013.
 [19] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sumproduct networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
 [20] R. Peharz, G. Kapeller, P. Mowlaee, and F. Pernkopf. Modeling speech with sumproduct networks: Application to bandwidth extension. In Proceedings of ICASSP, pages 3699–3703, 2014.
 [21] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sumproduct networks. In Proceedings of AISTATS, pages 744–752, 2015.
 [22] H. Poon and P. Domingos. Sumproduct networks: A new deep architecture. In Proceedings of UAI, pages 337–346, 2011.
 [23] A. Rashwan, H. Zhao, and P. Poupart. Online and distributed bayesian moment matching for parameter learning in sumproduct networks. In AISTATS, pages 1469–1477, 2016.
 [24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of ICML, pages 1278–1286, 2014.
 [25] A. Rooshenas and D. Lowd. Learning SumProduct Networks with Direct and Indirect Variable Interactions. ICML – JMLR W&CP, 32:710–718, 2014.
 [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
 [27] M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl. Safe semisupervised learning of sumproduct networks. In Proceedings of UAI, 2017.
 [28] M. Trapp, R. Peharz, M. Skowron, T. Madl, F. Pernkopf, and R. Trappl. Structure inference in sumproduct networks using infinite sumproduct trees. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
 [29] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In Proceedings of ICML, pages 467–475, 2014.
 [30] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of ICML, 2016.
 [31] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sumproduct network structure learning. In Proceedings of ECML/PKDD, pages 343–358. Springer, 2015.
 [32] A. Vergari, N. Di Mauro, and F. Esposito. Visualizing and understanding sumproduct networks. preprint arXiv, 2016.
 [33] A. Vergari, R. Peharz, N. Di Mauro, A. Molina, K. Kersting, and F. Esposito. Sumproduct autoencoding: Encoding and decoding representations using sumproduct networks. In AAAI, 2018.
 [34] H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sumproduct networks. In Proceedings of ICML, 2016.
 [35] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sumproduct networks and Bayesian networks. In Proceedings of ICML, 2015.
 [36] H. Zhao, P. Poupart, and G. J Gordon. A unified approach for learning the parameters of sumproduct networks. In Proceedings of NIPS. 2016.