Probabilistic Deep Learning using Random Sum-Product Networks

Probabilistic Deep Learning using Random Sum-Product Networks

Robert Peharz
Dept. of Engineering
University of Cambridge

Martin Trapp
Austrian Research Institute
for Artificial Intelligence
&Antonio Vergari
Max Planck Institute
for Intelligent Systems

Kristian Kersting
Computer Science Dept.
TU Darmstadt
&Karl Stelzner
Computer Science Dept.
TU Darmstadt

Zoubin Ghahramani
Dept. of Engineering
University of Cambridge
&Alejandro Molina
Computer Science Dept.
TU Darmstadt
Abstract

Probabilistic deep learning currently receives an increased interest, as consistent treatment of uncertainty is one of the most important goals in machine learning and AI. Most current approaches, however, have severe limitations concerning inference. Sum-Product networks (SPNs), although having excellent properties in that regard, have so far not been explored as serious deep learning models, likely due to their special structural requirements. In this paper, we make a drastic simplification and use a random structure which is trained in a “classical deep learning manner” such as automatic differentiation, SGD, and GPU support. The resulting models, called RAT-SPNs, yield comparable prediction results to deep neural networks, but maintain well-calibrated uncertainty estimates which makes them highly robust against missing data. Furthermore, they successfully capture uncertainty over their inputs in a convincing manner, yielding robust outlier and peculiarity detection.

 


Probabilistic Deep Learning using Random Sum-Product Networks


 


Robert Peharz Dept. of Engineering University of Cambridge Martin Trapp Austrian Research Institute for Artificial Intelligence                        Antonio Vergari Max Planck Institute for Intelligent Systems Kristian Kersting Computer Science Dept. TU Darmstadt                        Karl Stelzner Computer Science Dept. TU Darmstadt Zoubin Ghahramani Dept. of Engineering University of Cambridge                        Alejandro Molina Computer Science Dept. TU Darmstadt


1 Introduction

Dealing with uncertainty clearly is one of the most important aspects of machine learning and AI. An intelligent system should be able to deal with uncertain inputs (e.g. missing features) as well as express its uncertainty over outputs. Especially the latter is a crucial point in decision-making processes such as in medical diagnosis and planning systems for autonomous agents. It is, therefore, no surprise that probabilistic approaches have recently gained tremendous momentum also in deep learning, the currently predominant branch in machine learning. Examples of probabilistic deep learning systems are variational autoencoders (VAEs) [14], deep generative models [24], generative adversarial nets (GANs) [11], neural auto-regressive density estimators (NADEs) [15], and Pixel-CNNs/RNNs [30].

However, most of these probabilistic deep learning approaches have limited capabilities when it comes to inference. Implicit likelihood models like GANs, even when successful in capturing the data distribution, do not allow to evaluate the probability of a test sample. Similar problems arise in deep generative models and VAEs, which typically use an inference network to infer the posterior over a latent variable space. However, inference in both these models and—ironically—also their inference networks is limited to drawing samples, which forces users to retreat to Monte Carlo estimates. NADEs and Pixel-CNNs/RNNs, both instances of auto-regressive density estimators, allow to efficiently evaluate sample likelihoods and even certain marginalization and conditioning tasks if the marginalized/conditioned variables appear first in the assumed variable ordering. Otherwise, inference is rendered intractable. Uria et al. [29] addresses this problem by training an ensemble of NADEs with shared network structure. This approach, however, introduces the delicate problem of approximately training a super-exponential ensemble of NADEs.

Sum-Product Networks (SPNs) are a class of probabilistic models with a crucial advantage over the models above [22]: they permit exact and efficient inference. More precisely, they are able to compute any marginalization and conditioning query in time linear of the model’s representation size. However, although SPNs can be described in a nutshell as “deep mixture models” [19], they have received rather limited attention in the deep learning literature, despite their attractive inference properties. We identify three reasons for this situation. First, the structure of SPNs needs to obey certain constraints, requiring either careful structure design by hand or learning the structure from data [6, 9, 18, 25, 31, 1, 28]. Second, the parameter learning schemes proposed so far are either inspired by graphical models [22, 34, 19] or are tailored to SPNs [8]. These peculiarities concerning structure and parameter learning probably hindered a wide application of SPNs in the connectionist approach so far. Third, there seems to be a folklore that SPNs are “somewhat weak function approximators”, i.e., it is widely believed that SPNs are less suitable to solve prediction tasks to an extent we expect from deep neural networks. However, this belief is not theoretically grounded. SPNs inherit universal approximation properties from mixture models — as a mixture model is simply a “shallow” SPN with a single sum node. Consequently, SPNs should in theory also able to represent any prediction function via probabilistic inference.

In this paper, we empirically demystify this folklore and investigate the fitness of SPNs as deep learning models. To this aim, we introduce a novel and particularly simple way to construct SPNs, waiving the necessity for structure learning. Our SPNs are obtained by first constructing a random region graph [6, 18] laying out the overall network design. Subsequently, the region graph is populated with tensors of SPN nodes, which allows an easy mapping on deep learning frameworks such as TensorFlow [7]. Consequently, our models—called Random Tensorized SPNs (RAT-SPNs)—can be optimized in an end-to-end fashion, using standard deep learning techniques such as automatic differentiation, adaptive SGD optimizers, and automatic GPU-parallelization. To avoid overfitting, we adopt the well-known dropout heuristic [26], which yields an elegant probabilistic interpretation as marginalization of missing features (dropout at inputs) and as injection of discrete noise (dropout at sum nodes). We trained RAT-SPNs on several real-world classification data sets, showing that their prediction performances are comparable to traditional deep neural networks. At the same time, RAT-SPNs specify a complete distribution over both inputs and outputs, which allows us to treat uncertainty in a consistent and efficient manner. First, we show that RAT-SPNs are dramatically more robust against missing features than neural networks. Second, we show that RAT-SPNs also provide well-calibrated uncertainty estimates over their inputs, i.e., the model “knows what it does not know”, which can be exploited for anomaly and out-of-domain detection.

We proceed as follows. After reviewing background and related work, Section 3 introduces RAT-SPNs and shows how to implement and train them. Before concluding, we presents our empirical evidence in Section 4.

2 Related Work

We denote random variables (RVs) by upper-case letters, e.g. , , and their values as corresponding lower-case letters, e.g. . Similarly, we denote sets of RVs as , and their combined values as , .

An SPN over is a probabilistic model defined via a directed acyclic graph (DAG) containing three types of nodes: input distributions, sums and products. All leaves of the SPN are distribution functions over some subset . When we know that a node is a leaf, we also use the explicit symbol . Inner nodes are either weighted sums or products, denoted as and , respectively, i.e., and , where denotes the children of . The sum weights are assumed to be non-negative and normalized, i.e., , .

The scope of an input distribution is defined as the set of RVs for which is defined: . The scope of an inner node is recursively defined as . To allow efficient inference, SPNs are required to fulfill two structure constraints [5, 22], namely completeness and decomposability. An SPN is complete if for each sum it holds that , for each . An SPN is decomposable if it holds for each product that , for each . In that way, all nodes in an SPN recursively define a distribution over their respective scopes: the leaves are distributions by definition, sum nodes are mixtures of their child distributions, and products are factorized distributions, i.e., assuming independence among the scopes of their children.

Besides representing probability distributions, the crucial advantage of SPNs is that they allow efficient inference: In particular, any marginalization task reduces to the corresponding marginalizations at the leaves (each leaf marginalizing only over its scope), and recursively evaluating the internal nodes in a bottom-up pass [21]. Thus, marginalization in SPNs follows essentially the same procedure as evaluating the likelihood of a sample—both scale linearly in the SPN’s representation size (assuming tractable marginalization at the leaves). Conditioning can be tackled in a similar manner.

Learning the parameters of SPNs, i.e., the sum weights and the parameters of input distributions, can be addressed in various ways. By interpreting the sum nodes as discrete latent variables [22, 35, 19], SPNs can be trained using the classical expectation-maximization (EM) algorithm. “Hard versions” of EM and gradient descent have been proposed in [22, 8]. Gens and Domingos [8], e.g., trained SPNs using a discriminative objective, achieving then state-of-the-art classification results on image benchmarks. However, the SPN structure employed there was rather shallow and relied on a rich and hand-crafted feature extraction. Bayesian learning schemes have been proposed in [23, 34]. Zhao et al. [36] derived a concave-convex procedure, which interestingly coincides with the EM updates for sum-weights. Subsequently, Trapp et al. [27] introduced a safe semi-supervised learning scheme for discriminative and generative parameter learning, providing guarantees for the performance in the semi-supervised case. Vergari et al. [33] extended SPNs to representation learning, exploiting SPN inference as encoding and decoding routines.

The structure of SPNs can be crafted by hand [22, 20] or learned from data. Most structure learners [25, 31, 1, 16] can be framed as variations of the prototypical top-down scheme LearnSPN due to Gens and Domingos [9]. It recursively splits the data via clustering (to determine sum nodes) and independence tests (for product nodes). The high cost of these repeated splits makes structure learning the bottleneck in training SPNs. In the present paper, we make a drastic simplification by picking a scalable random structure and optimizing its parameters with available deep learning tools.

3 Random Tensorized Sum-Product Networks

Figure 1: Example RAT-SPN over 7 RVs for three classes , split depth of , number of repetitions , number of sum nodes and number of input distributions . White and gray boxes represent regions and partitions in the underlying random region graph, respectively. Nodes in the RAT-SPN can naturally be organized in layers as indicated by dashed boxes.

To construct random-and-tensorized SPNs (RAT-SPNs) we use a region graph [22, 6, 18] as an abstracted representation of the network structure. Given a set of RVs , a region is defined as any non-empty subset of . Given any region , a -partition of is a collection of non-empty, non-overlapping subsets of , whose union is again , i.e., , , , . Specifically, we here consider only 2-partitions, which will cause all product nodes in our SPNs to have exactly two children. This assumption, often made in the SPN literature, simplifies SPN design and does not impair performance [31].

Now, a region graph over is a DAG whose nodes are regions and partitions such that the following holds:

  • is a region in and has no parents (root region). All other regions have at least one parent.

  • All children of regions are partitions and all children of partitions are regions (i.e., is bipartite).

  • If is a child of , then .

  • If is a child of , then .

From this definition it is evident that a region graph dictates a hierarchical partition of the scope . We denote regions which have no child partitions as leaf regions.

Given a region graph, we can construct a corresponding SPN, as illustrated in Alg. 1. Here, each of the classes is represented by a sum node in the root region. is the number of input distributions per leaf region, and is the number of sum nodes in regions, which are neither leaf nor root regions. It is easy to verify that this scheme leads to a complete and decomposable SPN.

1:procedure ConstructSPN()
2:     Make empty SPN
3:     for  do
4:         if  is a leaf region then
5:              Equip with distribution nodes
6:         else if  is the root region then
7:              Equip with sum nodes
8:         else
9:              Equip with sum nodes               
10:     for  do
11:         Let be the nodes for region
12:         for  do
13:              Introduce product
14:              Let be a child for each               
15:     return SPN
Algorithm 1 Construct SPN from Region Graph

Within this region graph SPN framework, we are able to deal both with multiclass classification—each sum node in the root region represent a class conditional, the classes sharing the SPN structure below—and density estimation—in which case is simply .

3.1 Random Region Graphs

To construct now random region graphs and in turn RAT-SPNs, we follow Alg. 2. We randomly divide the root region into two sub-regions of equal size (breaking ties in case of an odd number of RVs) and proceed recursively down to depth , resulting in an SPN of depth . This recursive splitting mechanism is repeated times. Fig. 1 shows an SPN for classification built following Alg. 2.

1:procedure RandomRegionGraph()
2:     Create an empty region graph
3:     Insert in
4:     for  do
5:         Split      
1:procedure Split()
2:     Draw balanced partition of
3:     Insert in
4:     Insert in
5:     if  then
6:         if  then Split          
7:         if  then Split               
Algorithm 2 Random Region Graph

Moreover, this construction scheme yields (RAT-)SPNs where input distributions, sums, and products can be naturally organized in alternating layers. Similar to classical multilayer perceptrons (MLPs), each layer takes inputs from its directly preceding layer only. Unlike MLPs, however, layers in RAT-SPNs are connected block-wise sparsely in a random fashion. Thus, layers in MLPs and RAT-SPNs are hardly comparable; however, we suggest to understand each pair of sum and product layer to be roughly corresponding to one layer in an MLP: sum layers play the role of (sparse) matrix multiplication and product layers as non-linearities (or, more precisely, bi-linearities of their inputs). Indeed, RAT-SPNs are similar in spirit to the re-parametrization of SPNs as MLPs considered by Vergari et al. [32]; however, our constructions here combines nodes in blocks and reduces the overall sparseness.

3.2 Training and Implementation

Let be a training set of inputs and class labels . Furthermore, let denote the output of the RAT-SPN and all SPN parameters. We train RAT-SPNs by minimizing the objective

(1)

where is the cross-entropy

(2)

and denotes the normalized negative log-likelihood

(3)

By setting , we purely train on cross-entropy (discriminative setting), while for we perform pure maximum likelihood training (generative setting). For , we have a continuum of hybrid objectives, trading off the generative and discriminative character of the model.

We implemented RAT-SPNs in Python/TensorFlow, where each region in our region graph is associated a matrix with as many rows as the used batch size (we used a batch size of consistently). Each column representing one distribution in the regions, i.e., , and in input regions, internal regions and the root region, respectively. We perform all computation in the domain. As it is common, multiplying small probability values in the linear domain quickly approaches zero, making the computations prone to underflows. Therefore, we practically replace product nodes with additions, and sum nodes with log-sum-exp operations, employing the frequently used “trick” to compute via . This function is readily provided in Tensorflow.

Implementing RAT-SPNs in TensorFlow allows us to optimize our objective using automatic differentiation and off-the-shelf gradient-based optimizers. Throughout our experiments, we used Adam [13] in its default settings. As input distributions, we used Gaussian distributions with isotropic covariances, i.e., each input distribution reduces to a product layer combining single dimensional Gaussians with shared variances. We tried to optimize the variances jointly with the means which, however, delivered worse results than merely setting all variances uniformly to . We conjecture that Adam might not be well-suited to optimize variances, as optimization schemes like EM have no problem in this case [19]. While RAT-SPNs are implemented and trained in a seemingless way, they unfortunately yield hundreds of tensors, which is a non-optimal layout in TensorFlow. This, together with performing computations in the log-domain, causes that RAT-SPNs are approximately an order of magnitude slower to train than ReLU-MLPs of similar sizes. Note, however, that this disadvantage is mostly caused by the current state of hardware and software development for deep learning; it is not of principal nature.

3.3 Probabilistic Dropout

The size of RAT-SPNs can be easily controlled via the structural parameters , , and . RAT-SPNs with many parameters, however, tend to overfit—just like regular neural networks—which requires regularization. One of the classical techniques that boosted deep learning models is Srivastava et al.’s dropout heuristic [26]. It sets inputs and/or hidden units to zero with a certain probability , and multiplies the remaining layer outputs with . In the following we modify the dropout heuristic, proposing two variants for RAT-SPNs, exploiting their probabilistic nature.

3.3.1 Dropout at Inputs: Marginalizing out Inputs

Dropout at inputs essentially marks input features as missing at random. In the probabilistic paradigm, we would simply marginalize over these missing features. Fortunately, this is an easy exercise in SPNs, as we only need to set the distributions corresponding to the dropped-out features to . As we operate in the log-domain, this means to set the corresponding log-distribution nodes to . This is in fact quite similar to standard dropout, except that we are not compensating by , and blocks of units are dropped out (i.e., all log-distributions whose scope corresponds to a missing input feature are jointly set to ).

3.3.2 Dropout at Sums: Injection of Discrete Noise

As discussed in [22, 35, 19], sum nodes in SPNs can be interpreted as marginalized latent variables, akin to the latent variable interpretation in mixture models. In particular, [19] introduced so-called augmented SPNs which explicitly incorporate these latent variables in the SPN structure. The augmentation first introduces indicator nodes representing the states of the latent variables, which can switch the children of sum nodes on or off by connecting them via an additional product. This mechanism establishes the explicit interpretation of sum children as conditional distributions. In the case that completeness of the resulting SPN is impaired, additional sum nodes (twin sums) are introduced to complete the probabilistic model. See the discussion of Peharz et al. [19] for more details.

In RAT-SPNs, we can equally well interpret a whole region as a single latent variable, and the weights of each sum node in this region as the conditional distribution of this variable. Indeed, as is easily checked, the argumentation in [19] also holds when introducing a set of indicators for a single latent variable which is shared by all sum nodes in one region, as they all have the same scope and the same children. While the latent variables are not observed, we can employ a simple probabilistic version of dropout, by introducing artificial observations for them. For example, if the sum nodes in a particular region have children (i.e. the corresponding variable has states), then we could introduce artificial information that assumes a state in some subset of . By doing this for each latent variable in the network, we essentially select a small sub-structure of the whole SPN to explain the data—this argument is very similar to the original dropout proposal [26].

In any case, implementing dropout at sum-layers is again straightforward: we select a subset of all product nodes which are connected to the sums in one region and set them to 0 (actually in the log-domain). Again we do not multiply with a correction factor.

4 Experiments

In the following we investigate the fitness of RAT-SPNs as deep learning models. Furthermore, we highlight their advantages when used as generative model.

4.1 Exploring the Capacity of Rat-Spns

We start off by exploring the capacity of RAT-SPNs as function approximators for classification. A simple way to assess model capacity is trying to overfit the training data with various model sizes. To this end, we fit RAT-SPNs on the MNIST111yann.lecun.com/exdb/mnist train data, using every combination of split depth , number of split repetitions and number of distributions per region . As we are in the data-agnostic setting, the natural baselines are MLPs, where we take ReLU activations for the hidden units and linear activations for the output layer. We ran MLPs with every combination of number of layers in and number of hidden units in . For both RAT-SPNs and MLPs, we used Adam with its default parameters to optimize cross-entropy (i.e., for RAT-SPNs).

Figure 2: Comparison of capability to fit the training set of MNIST for RAT-SPNs and MLPs (ReLU). We show (y-axis) the training accuracy (higher is better) after 200 epochs, as a function of the number of parameters (x-axis, please note the different scales). The ’depth’ refers to the number of hidden layers in MLP and to split depth in RAT-SPNs. It is clear that RAT-SPNs can fit the training dataset as well as MLPs and, in some cases, sooner than MLPs. (Best viewed in color).

Figure 2 summarizes the training accuracy of both models after 200 epochs as a function of the number of parameters in the respective model. As one can see see, RAT-SPNs can scale to millions of parameters, and furthermore, they are easily able to overfit the MNIST training set, to the same extent as MLPs. While for numbers of layers it seems that RAT-SPNs are suited slightly better to fit the data, this is in fact only an artifact of SGD optimization: MLPs still jitter around during the last epochs, while the accuracy of RAT-SPNs remains stable.

These overfitting results indicate that RAT-SPNs are capacity-wise at least as powerful as ReLU-MLPs. In the next experiment, we investigated whether RAT-SPNs are also on par with MLPs concerning generalization on classification tasks. Subsequently, we investigated whether RAT-SPNs, due to their probabilistic nature, exhibit superior performance when dealing with missing features and identifying outliers reliably.

4.2 Generalization of Rat-Spns

When trained without regularization, RAT-SPNs achieve less than on the test set of MNIST, which is rather inferior even for data-agnostic models. Therefore, we trained them with our probabilistic dropout variant as introduced in Section 3.3. We cross-validated , and number of distributions per region , and applied dropout rates for inputs in and for sum-layers in . A dropout rate of means that a fraction of features is actually kept.

For comparison, we trained ReLU-MLPs with number of hidden layers in , number of hidden units in , input dropout rates in and dropout rates for hidden layers in . No dropout was applied to the output layer. We trained MLPs in two variants, namely ’vanilla’ (vMLPs), meaning that besides dropout no additional optimization tricks were applied, and a variant (MLP) also employing Xavier-initialization [10] and batch normalization [12]. While MLP should be considered the default variant to train MLPs, one should notice that helpful heuristics like Xavier-initialization and batch normalization have evolved over decades, while similar techniques for RAT-SPNs are not yet available. Thus, vMLPs might serve as a fairer comparison.

RAT-SPN MLP vMLP

Accuracy

MNIST 98.19 98.32 98.09
(8.5M) (2.64M) (5.28M)
F-MNIST 89.52 90.81 89.81
(0.65M) (9.28M) (1.07M)
20-NG 47.8 49.05 48.81
(0.37M) (0.31M) (0.16M)

Cross-Entropy

MNIST 0.0852 0.0874 0.0974
(17M) (0.82M) (0.22M)
F-MNIST 0.3525 0.2965 0.325
(0.65M) (0.82M) (0.29M)
20-NG 1.6954 1.6180 1.6263
(1.63M) (0.22M) (0.22M)
Table 1: Classification results for MNIST, fashion MNIST (F-MNIST) and 20 News Groups (20-NG) for RAT-SPNs, MLPs and ’Vanilla MLPs’ (vMLP). vMLPs are trained without Xavier-initialization and batch normalization. Best test values for accuracy and Cross-Entropy are reported, as well as the corresponding number of parameters in the model (in parenthesis).

We trained on MNIST, fashion-MNIST222Fashion-MNIST is a dataset in the same format as MNIST, but with the task of classifying fashion items rather than digits; github.com/zalandoresearch/fashion-mnist and 20 News Groups (20-NG).333scikit-learn.org/stable/datasets/twenty_newsgroups.html The 20-NG dataset is a text corpus of 18846 news documents that belong to 20 different news groups or classes. We first split the news documents into 13568 instances for training, 1508 for validation, and 3770 for testing. The text was pre-processed into a bag-of-words representation by keeping the top 1000 most relevant words according to their Tf-IDF. Then, 50 topics were extracted by LDA [2] and employed as the new feature representation for classification.

Table 1 summarizes the classification accuracy and cross-entropy on the test set, as well as the size of the models in terms of number of parameters. As one can see, RAT-SPNs are on par with MLPs, only slightly outperformed in terms of traditional classification tasks. Note, however, that our approach to set SPNs in a classical connectionist setting is rather simple; this, together with our capacity analysis in Section 4.1 indicates the potential of SPNs as prediction models. Moreover, as the following section shows, the real potential of probabilistic deep learning models actually lies beyond classical benchmark results.

4.3 Hybrid Post-Training

Recall that SPNs define a full distribution over both inputs and class variable , and that our objective (1) with parameter allows us to trade off between cross-entropy () and log-likelihood (). When , we cannot hope that the distribution is faithful to the underlying data. By setting , however, we can obtain interesting hybrid models, yielding both a discriminative and generative behavior. To this end, we use the RAT-SPN with highest validation accuracy, and post-train it for another 20 epochs, for various values of . This yields a natural trade-off between the log-likelihood over inputs and predictive performance regarding classification accuracy/cross-entropy. Figure 3 shows this trade-off. As one can see, by sacrificing little predictive performance, we can drastically improve the generative character of SPNs. The benefit of this is shown in the following.

Figure 3: A RAT-SPN is a joint model over both inputs and classes and allows to evaluate the likelihood over the inputs. By varying we can control the trade-off between generative behavior (measured in log-likelihood) and discriminative behavior (measured in accuracy or cross-entropy).

4.4 Spns Are Robust Against Missing Features

Figure 4: Robustness of different hybridly trained RAT-SPNs and MLP as the accuracy (y-axis) when dealing with features missing at random when their percentage of missing varies from 0.0 (all features available) to .99 (almost all missing).
Figure 5: Outliers (samples with log-likelihood ) and inliers (samples with log-likelihood ) on MNIST (top) and fashion-MNIST (bottom) for RAT-SPN post-trained with . Samples on the left half were classified correctly, samples on the right half were classified incorrectly. The upper rows are outliers, the lower rows are inliers, for MNIST and fashion-MNIST, respectively. The predictions for wrong MNIST digits are (depicted as correctpredicted): (top-row) 42, 73, 53, 29, 47, 53, 46, 62, 93, 26; (bottom-row) 49, 28, 42, 94, 27, 27, 28, 60, 49, 98.

When input features in are missing at random, the probabilistic paradigm offers a clear solution: the marginalization of missing features. As SPNs allow marginalization simply and efficiently, we expect that RAT-SPNs should be able to robustly treat missing features, especially the “more generative” they are (corresponding to smaller ). To this end, we randomly discard a fraction of pixels in the MNIST test data—independently for each sample—and classify the data using RAT-SPNs trained with various values of , marginalizing missing features. This is the same procedure we used for probabilistic dropout during training, cf. Section 3.3. Similarly, we might expect MLPs to perform robustly under missing features during test time, by applying (classical) dropout.

Figure 4 summarizes the classification results when varying between and . As one can see, RAT-SPNs with smaller are more stable against even large fractions of missing features. A particularly interesting choice is : here the corresponding RAT-SPN starts with an accuracy for no missing features and degrades very gracefully: for a large fraction of missing features () the advantage over MLPs is dramatic. Note that this result is consistent with other hybrid learning schemes applied in graphical models [18]. Purely discriminative RAT-SPNs and MLPs are roughly on par concerning robustness against missing features.

4.5 Spns Know What They Don’t Know

Besides being robust against missing features, an important feature of (hybrid) generative models is that they can in principle detect outliers and peculiarities by monitoring the likelihood over the inputs. To this end, we evaluated the likelihoods on the test set of both MNIST and fashion-MNIST evaluated on the respective RAT-SPN post-trained with . We selected two thresholds of and by visual inspection of the histograms over likelihoods of inputs. These two values determine roughly the percentiles of most likely/unlikely samples. In both these sets, we selected—following the original order in MNIST—the first 10 samples which are correctly and incorrectly classified, respectively. We thus got 4 groups of 10 samples each: outlier/correct, outlier/incorrect, inlier/correct, inlier/incorrect.

These samples are shown in Figure 5. Albeit qualitative, these results are interesting: One can visually confirm that the outlier MNIST digits are indeed peculiar, both the correctly and the incorrectly classified ones. Among the outlier/incorrect group are 2 samples (top row, right, 3rd and 8th), which are not recognizable to the authors either. The inlier/incorrect digits can be interpreted—with some care and a grain of salt—as the ambiguous ones, e.g. two ’2’s (bottom row, right, 5th and 6th) are similar to ’7’ (and indeed classified as such), or a digit (bottom row, right, 8th) which could either be or . For fashion-MNIST, one can clearly see that the outliers are all low in contrast and fill the whole image. In one images (top row, right, 9th) the background has not been removed.

More objectively, we use Bradshaw et al.’s Transfer Testing (TT), a technique to assess the calibration of uncertainties in probabilistic models [3]. TT is quite simple: we feed a classifier trained on one domain (e.g. MNIST) with examples from a related but different domain (e.g. street view house numbers (SVHN) [17] or the handwritten digits of SEMEION [4]). While we would expect that most classifiers perform poorly in such setting, an important property of an AI system would be to be aware that it is confronted with out-of-domain data and be able to communicate this either to other parts of the system or a human user. While Bradshaw et al. applied TT to conditional models, i.e., over output uncertainties, a more natural approach would be to apply it to input likelihoods, if available, such as in SPNs.

Figure 6, top, shows histograms of the log-likelihoods of the RAT-SPN post-trained with , when fed with MNIST test data (in-domain), SVHN test data (out-of-domain) and SEMEION (out-of-domain). The result is striking: the histogram shows that the likelihood over inputs provides a strong signal (note the y-axis log-scale) whether a sample comes from in-domain or out-of-domain. That is, RAT-SPNs have an additional communication channel—the likelihood over the inputs—to tell us whether they are confident with their predictions.

An MLP, as a non-probabilistic model, does not have such a mean. As a sanity check, however, we mimic the same computations performed in RAT-SPNs to obtain a log-likelihood: adding to each output (we assume uniform class-prior) and computing log-sum-exp of the result. One might suspect, that the result, although not interpretable as log-probability in MLPs, still yields a decent measure of confidence. In need of a name for this rather odd quantity, we name it mock-likelihood. Figure 6, bottom, shows histograms of this mock-likelihood: although histograms are more spread for out-of-domain data, they are highly overlapping, yielding no clear signal for out-of-domain vs. in-domain.

Taking all experimental results together, (RAT-)SPNs are powerful deep learning models. They are robust function approximators as well as probability estimators for arbitrary inputs and outputs, with fast and exact inference.

Figure 6: Histograms of test log-likelihoods for MNIST, SVHN and SEMEION data for RAT-SPN (top) and corresponding computations performed for MLP (“mock-likelihood”) (bottom). Both models were trained on MNIST. The likelihoods of RAT-SPNs yields a strong signal whether a sample is in-domain or out-of-domain.

5 Conclusion

We introduced a particularly simple but effective way to train SPNs: simply pick a random structure and train them in end-to-end fashion like neural networks. This makes the application of SPNs within the deep learning framework seamless and allows the application of common deep learning tools such automatic differentiation and easy use of GPUs. As a modest technical contribution, we adapted the well-known dropout heuristic and equipped it with a sound probabilistic interpretation within RAT-SPNs. RAT-SPNs showed a performance on par with traditional neural networks on several classification tasks. Moreover, RAT-SPNs demonstrate their full power when used as a generative model, showing remarkable robustness against missing features through exact and efficient inference and compelling results in anomaly/out-of-domain detection. In future work, the hybrid properties of RAT-SPNs could allow promising directions like new variants of semi-supervised or active learning. While this paper is held in the data-agnostic regime, in future we will investigate SPNs tailored to structured data sources.

References

  • [1] T. Adel, D. Balduzzi, and A. Ghodsi. Learning the structure of sum-product networks via an SVD-based algorithm. In UAI, 2015.
  • [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3, 2003.
  • [3] J. Bradshaw, A. Matthews, and Z. Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks. preprint arXiv, 2017. arxiv.org/abs/1707.02476.
  • [4] M. Buscema. MetaNet*: The Theory of Independent Judges, volume 33. 02 1998.
  • [5] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50(3):280–305, 2003.
  • [6] A. Dennis and D. Ventura. Learning the architecture of sum-product networks using clustering on variables. In Proceedings of NIPS, 2012.
  • [7] Abadi M. et al. (40 authors). TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [8] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Proceedings of NIPS, pages 3248–3256, 2012.
  • [9] R. Gens and P. Domingos. Learning the structure of sum-product networks. Proceedings of ICML, pages 873–880, 2013.
  • [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS, pages 249–256, 2010.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of NIPS, pages 2672–2680, 2014.
  • [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Prooceedings of ICML, 2015.
  • [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of ICLR, 2015.
  • [14] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014. arXiv:1312.6114.
  • [15] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In Proceedings of AISTATS, pages 29–37, 2011.
  • [16] A. Molina, A. Vergari, N. Di Mauro, S. Natarajan, F. Esposito, and K. Kersting. Mixed sum-product networks: A deep architecture for hybrid domains. In Proceedings of AAAI, 2018.
  • [17] Y. Netzer, T. Wang, A. Coates, A-Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  • [18] R. Peharz, B. Geiger, and F. Pernkopf. Greedy part-wise learning of sum-product networks. In Proceedings of ECML/PKDD, pages 612–627. Springer Berlin, 2013.
  • [19] R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sum-product networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [20] R. Peharz, G. Kapeller, P. Mowlaee, and F. Pernkopf. Modeling speech with sum-product networks: Application to bandwidth extension. In Proceedings of ICASSP, pages 3699–3703, 2014.
  • [21] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sum-product networks. In Proceedings of AISTATS, pages 744–752, 2015.
  • [22] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proceedings of UAI, pages 337–346, 2011.
  • [23] A. Rashwan, H. Zhao, and P. Poupart. Online and distributed bayesian moment matching for parameter learning in sum-product networks. In AISTATS, pages 1469–1477, 2016.
  • [24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of ICML, pages 1278–1286, 2014.
  • [25] A. Rooshenas and D. Lowd. Learning Sum-Product Networks with Direct and Indirect Variable Interactions. ICML – JMLR W&CP, 32:710–718, 2014.
  • [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
  • [27] M. Trapp, T. Madl, R. Peharz, F. Pernkopf, and R. Trappl. Safe semi-supervised learning of sum-product networks. In Proceedings of UAI, 2017.
  • [28] M. Trapp, R. Peharz, M. Skowron, T. Madl, F. Pernkopf, and R. Trappl. Structure inference in sum-product networks using infinite sum-product trees. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
  • [29] B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In Proceedings of ICML, pages 467–475, 2014.
  • [30] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of ICML, 2016.
  • [31] A. Vergari, N. Di Mauro, and F. Esposito. Simplifying, regularizing and strengthening sum-product network structure learning. In Proceedings of ECML/PKDD, pages 343–358. Springer, 2015.
  • [32] A. Vergari, N. Di Mauro, and F. Esposito. Visualizing and understanding sum-product networks. preprint arXiv, 2016.
  • [33] A. Vergari, R. Peharz, N. Di Mauro, A. Molina, K. Kersting, and F. Esposito. Sum-product autoencoding: Encoding and decoding representations using sum-product networks. In AAAI, 2018.
  • [34] H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sum-product networks. In Proceedings of ICML, 2016.
  • [35] H. Zhao, M. Melibari, and P. Poupart. On the relationship between sum-product networks and Bayesian networks. In Proceedings of ICML, 2015.
  • [36] H. Zhao, P. Poupart, and G. J Gordon. A unified approach for learning the parameters of sum-product networks. In Proceedings of NIPS. 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202178
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description