Parsimonious Bayesian deep networks

# Parsimonious Bayesian deep networks

Mingyuan Zhou
The University of Texas at Austin, Austin, TX 78712
mingyuan.zhou@mccombs.utexas.edu
###### Abstract

Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacity-regularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.

## 1 Introduction

To separate two linearly separable classes, a simple linear classifier such logistic regression will often suffice, in which scenario adding the capability to model nonlinearity not only complicates the model and increases computation, but also often harms rather improves the performance by increasing the risk of overfitting. On the other hand, for two classes not well separated by a single hyperplane, a linear classifier is often inadequate, and hence it is common to use either kernel support vector machines (Vapnik, 1998; Schölkopf et al., 1999) or deep neural networks (Hinton et al., 2006; LeCun et al., 2015; Goodfellow et al., 2016) to nonlinearly transform the covariates, making the two classes become more linearly separable in the transformed covariate space. While being able to achieve high classification accuracy, they both have clear limitations. For a kernel based classifier, its number of support vectors often increases linearly in the size of training data (Steinwart, 2003), making it not only computationally expensive and memory inefficient to train for big data, but also slow in out-of-sample predictions. A deep neural network could be scalable with an appropriate network structure, but it is often cumbersome to tune the network depth (number of layers) and width (number of hidden units) of each hidden layer (Goodfellow et al., 2016), and has the danger of overfitting the training data if a deep neural network, which is often equipped with a larger than necessary modeling capacity, is not carefully regularized.

Rather than making an uneasy choice in the first place between a linear classifier, which has fast computation and resists overfitting but may not provide sufficient class separation, and an over-capacitized model, which often wastes computation and requires careful regularization to prevent overfitting, we propose a parsimonious Bayesian deep network (PBDN) that builds its capacity regularization into the greedy-layer-wise construction and training of the deep network. More specifically, we transform the covariates in a layer-wise manner, with each layer of transformation designed to facilitate class separation via the use of the noisy-OR interactions of multiple weighted linear hyperplanes. Related to kernel support vector machines, the hyperplanes play a similar role as support vectors in transforming the covariate space, but they are inferred from the data and their number increases at a much slower rate with the training data size. Related to deep neural networks, the proposed multi-layer structure gradually increases its modeling capability by increasing its number of layers, but allows inferring from the data both the width of each hidden layer and depth of the network to prevent building a model whose capacity is larger than necessary.

For capacity regularization, we choose to shrinkage both the width and depth of the proposed PBDN. To shrink the width of a hidden layer, we propose the use of a gamma process (Ferguson, 1973), a draw from which consists of countably infinite atoms, each of which is used to represent a hyperplane in the covariate space. The gamma process has an inherence shrinkage mechanism as its number of atoms, whose random weights are larger than a certain positive constant , follows a Poisson distribution, whose mean is finite almost surely (a.s.) and reduces towards zero as increases. To shrink the depth of the network, we propose a layer-wise greedy-learning strategy that increases the depth by adding one hidden layer at a time, and uses an appropriate model selection criterion to decide when to stop adding another one. Our experiments show the proposed capacity regularization strategy helps successfully build a PBDN, providing state-of-the-art classification accuracy while maintaining low computational complexity for out-of-sample prediction. We have also tried applying a highly optimized off-the-shelf deep neural network based classifier, whose network architecture for a given data is set to be the same as that inferred by the PBDN. However, we have found no performance gains, suggesting the efficacy of the PBDN’s greedy training procedure that requires neither cross-validation nor fine-tuning.

## 2 Layer-width learning via infinite support hyperplane machines

The first essential component of the proposed capacity regularization strategy is to learn the width of a hidden layer. To fulfill this goal, we define infinite support hyperplane machine (iSHM), a label-asymmetric classifier, that places in the covariate space countably infinite hyperplanes , where each is associated with a weight . We use a gamma process (Ferguson, 1973) to generate , making the infinite sum be finite almost surely (a.s.). We measure the proximity of a covariate vector to using the softplus function of their inner product as , which is a smoothed version of that is widely used in deep nueral networks (Nair and Hinton, 2010; Glorot et al., 2011; Krizhevsky et al., 2012; Shang et al., 2016). We consider that is far from hyperplane if is close to zero. Thus, as moves away from hyperplane , that proximity measure monotonically increases on one side of the hyperplane while decreasing on the other. We then pass , a non-negative weighted combination of these proximity measures, through the Bernoulli-Poisson link (Zhou, 2015) to define the conditional class probability as

 P(yi=1|{rk,βk}k,xi)=1−∏∞k=1(1−pik),      pik=1−e−rkln(1+eβ′kxi). (1)

Note the model treats the data labeled as “0” and “1” differently, and it is evident from (1) that in general . We will show that

 ∑ipikxi/∑ipik\vspace0mm (2)

can be used to represent the th data subtype discovered by the algorithm.

One may readily notice from (1) that the noisy-OR construction, widely used in probabilistic reasoning (Pearl, 1988; Jordan et al., 1999; Arora et al., 2016), is generalized by iSHM to attribute a binary outcome of to countably infinite hidden causes . Denoting as the logical OR operator, as if and if , we have an augmented form of (1) as

 yi=⋁∞k=1bik,  bik∼Bernoulli(pik), pik=1−e−θik,  θik∼Gamma(rk,eβ′kxi), (3)

where can be further augmented as where , , and equals to 1 if the condition is satisfied and 0 otherwise.

We now marginalize out to formally define iSHM. Let denote a gamma process defined on the product space , where , , and is a finite and continuous base measure over a complete separable metric space . As illustrated in Fig. 2 (b), given a draw from , expressed as , where is an atom and is its weight, the infinite support hyperplane machine (iSHM) generates the label under the Bernoulli-Poisson link (Zhou, 2015) as

 yi|G,xi∼Bernoulli(1−e−∑∞k=1rkln(1+ex′iβk)),\vspace0mm (4)

which can be represented as a noisy-OR model as in (3) or, as shown in Fig. 2 (a), constructed as

 yi =δ(mi≥1),  mi=∑∞k=1mik, mik∼Pois(θik),  θik∼Gamma(rk,eβ′kxi). (5)

From (3) and (5), it is clear that one may declare hyperplane as inactive if .

### 2.1 Inductive bias and distinction from MLP

Below we reveal the inductive bias of iSHM in prioritizing the fit of the data labeled as “1,” due to the use of the Bernoulli-Poisson link function that has previously been applied for social network analysis (Zhou, 2015; Caron and Fox, 2015; Zhou, 2018). Since the negative log-likelihood (NLL) for can be expressed as we have if and if . As quickly explodes towards as , when , iSHM would adjust and to avoid at all cost overly suppressing (, making too small). By contrast, it has a high tolerance of failing to sufficiently suppress with . Thus each with would be made sufficiently close to at least one active support hyperplane. By contrast, while each with is desired to be far away from any support hyperplanes, violating that is typically not strongly penalized. Therefore, by training a pair of iSHMs under two opposite labeling settings, two sets of support hyperplanes could be inferred to sufficiently cover the covariate space occupied by the training data from both classes.

Note as in (4), iSHM may be viewed as an infinite-wide single-hidden-layer neural network that connects the input layer to the th hidden unit via the connections weights and the softplus nonlinear activation function , and further pass a non-negative weighted combination of these hidden units through the Bernoulli-Poisson link to obtain the conditional class probability. From this point of view, it can be related to a single-hidden-layer multilayer perceptron (MLP) (Bishop, 1995; Goodfellow et al., 2016) that uses a softplus activation function and cross-entropy loss, with the output activation expressed as , where , is the number of hidden units, , and . Note minimizing the cross-entropy loss is equivalent to maximizing the likelihood of which is biased towards fitting neither the data with nor these with , since Therefore, while iSHM is structurally similar to an MLP, it is distinct in its unbounded layer width, its positive constraint on the weights connecting the hidden and output layers, its ability to rigorously define whether a hyperplane is active or inactive, and its inductive bias towards fitting the data labeled as “1.” As in practice labeling which class as “1” may be arbitrary, we predict the class label with where and are from a pair of iSHMs trained by labeling the data belonging to this class as “1” and “0,” respectively.

### 2.2 Convex polytope geometric constraint

It is straightforward to show that iSHM with a single unit-weighted hyperplane reduces to logistic regression . To interpret the role of each individual support hyperplane when multiple non-negligibly weighted ones are inferred by iSHM, we analogize each to an expert of a committee that collectively make binary decisions. For expert (hyperplane) , the weight indicates how strongly its opinion is weighted by the committee, represents that it votes “No,” and represents that it votes “Yes.” Since , the committee would vote “No” if and only if all its experts vote “No” (, all are zeros), in other words, the committee would vote “Yes” even if only a single expert votes “Yes.” Let us now examine the confined covariate space that satisfies the inequality , where a data point is labeled as “1” with a probability no greater than . The following theorem shows that it defines a confined space bounded by a convex polytope, as defined by the intersection of countably infinite half-spaces defined by .

###### Theorem 1 (Convex polytope).

For iSHM, the confined space specified by the inequality

 P(yi=1|{rk,βk}k,xi)≤p0\vspace0mm (6)

is bounded by a convex polytope defined by the set of solutions to countably infinite inequalities as

 x′iβk≤ln[(1−p0)−1rk−1],  k∈{1,2,…}.\vspace0mm (7)

The convex polytope defined in (7) is enclosed by the intersection of countably infinite half-spaces. If we set as the probability threshold to make binary decisions, then the convex polytope assigns a label of to an inside the convex polytope (, an that satisfies all the inequalities in Eq. 7) with a relatively high probability, and assigns a label of to an outside the convex polytope (, an that violates at least one of the inequalities in Eq. 7) with a probability of at least . Note that hyperplane with has a negligible impact on the conditional class probability. Choosing the gamma process as the nonparametric Bayesian prior sidesteps the need to tune the number of experts. It shrinks the weights of all unnecessary experts, allowing automatically inferring a finite number of non-negligibly weighted ones (support hyperplanes) from the data. We provide in Appendix B the connections to previously proposed multi-hyperplane models (Aiolli and Sperduti, 2005; Wang et al., 2011; Manwani and Sastry, 2010, 2011; Kantchelian et al., 2014).

### 2.3 Gibbs sampling and MAP inference via SGD

For the convenience of implementation, we truncate the gamma process with a finite and discrete base measure as , where will be set sufficiently large to approximate the truly countably infinite model. We express iSHM using (5) together with

 rk∼Gamma(γ0/K,1/c0), γ0∼Gamma(a0,1/b0), c0∼Gamma(e0,1/f0), βk∼∏Vv=0∫N(0,α−1vk)Gamma(αvk;aβ,1/bβk)dαvk,  bβk∼Gamma(e0,1/f0).

Related to Tipping (2001), the normal gamma construction promotes sparsity on the connection weights .

We describe both Gibbs sampling, desirable for uncertainty quantification, and maximum a posteriori (MAP) inference, suitable for large-scale training, in Algorithm 1. We use data augmentation and marginalization to derive Gibbs sampling, with the details deferred to Appendix B. For MAP inference, we use Adam (Kingma and Ba, 2014) in Tensorflow to minimize a stochastic objective function as which embeds the hierarchical Bayesian model’s inductive bias and inherent shrinking mechanism into optimization, where is the size of a randomly selected mini-batch, , , and

 f({βk,lnrk}K1,{yi,xi}iMi1)=∑Kk=1(−γ0Klnrk+c0elnrk)+(aβ+1/2)∑Vv=0∑Kk=0 [ln(1+β2vk/(2bβk))]+NM∑iMi=i1[−yiln(1−e−λi)+(1−yi)λi]. (8)

## 3 Network-depth learning via forward model selection

The second essential component of the proposed capacity regularization strategy is to find a way to increase the network depth and determine how deep is deep enough. Our solution is to sequentially stack a pair of iSHMs on top of the previously trained one and develop a forward model selection criterion to decide when to stop stacking another pair. We refer to the resulted model as a parsimonious Bayesian deep network (PBDN), as described below in detail.

The noisy-OR hyperplane interactions allow iSHM to go beyond simple linear separation, but with limited capacity due to the convex-polytope constraint imposed on the decision boundary. On the other hand, it is the convex-polytope constraint that provides an implicit regularization, determining how many non-negligibly weighted support hyperplanes are necessary in the covariate space to sufficiently activate all data of class “1,” while somewhat suppressing the data of class “0.” In this paper, we discover that the model capacity could be quickly enhanced by sequentially stacking such convex-polytope constraints under a feedforward deep structure, while preserving the virtue of being able to learn the number of support hyperplanes in the (transformed) covariate space.

More specifically, as shown in Fig. 2 (c) of the Appendix, we first train a pair of iSHMs that regress the current labels and the flipped ones , respectively, on the original covariates . After obtaining support hyperplanes , constituted by the active support hyperplanes inferred by both iSHM trained with and the one trained with , we use as the hidden units of the second layer (first hidden layer). More precisely, denoting as the input data vector to layer , where , , , is an empty set, and , the th added pair of iSHMs transform into the hidden units of layer , expressed as

 ~x(t+1)i=[ln(1+e(x(t)i)′β(t→t+1)1),…,ln(1+e(x(t)i)′β(t→t+1)Kt+1)]′  .

Hence the input vectors used to train the next layer would be . Therefore, if the computational cost of a single inner product (, logistic regression) is one, then that for hidden layers would be about Note one may also use , or , or other related concatenation methods to construct the covariates to train the next layer.

Our intuition for why a PBDN, constructed in this greedy-layer-wise manner, works well is that for two iSHMs trained on the same covariate space under two opposite labeling settings, one iSHM places enough hyperplanes to define the complement of a convex polytope to sufficiently activate all data labeled as “1,” while the other does so for all data labeled as “0.” Thus, for any , at least one would be sufficiently activated, in other words, would be sufficiently close to at least one of the active hyperplanes of the iSHM pair. This mechanism prevents an from being completely suppressed after transformation. Consequently, these transformed covariates , which can also be concatenated with , will be further used to train another iSHM pair. Thus even though a single iSHM pair may not be powerful enough, by keeping all covariate vectors sufficiently activated after transformation, they could be simply stacked sequentially to gradually enhance the model capacity, with a strong resistance to overfitting and hence without the need of cross-validation.

While stacking an additional iSHM pair on the PBDN could enhance the model capacity, when hidden layers is sufficiently deeper that the two classes become well separated, there is no more need to add an extra iSHM pair. To detect when it is appropriate to stop adding another iSHM pair, as shown in Algorithm 2 of the Appendix, we consider a forward model selection strategy that sequentially stacks an iSHM pair after another, until the following criterion starts to rise:

 AIC(T)=∑Tt=1[2(Kt+1)Kt+1]+2KT+1−2∑i[lnP(yi|x(T)i)+lnP(y∗i|x(T)i)], (9)

where represents the cost of adding the th hidden layer and represents the cost of using nonnegative weights to connect the th hidden layer and the output layer. With (9), we choose a PBDN with hidden layers if for and . We also consider another model selection criterion that accounts for the sparsity of , the connection weights between adjacent layers, using

 \resizebox448.616025pt$\footnotesizeAICϵ(T)=∑Tt=12(∥|Bt|>ϵβtmax∥0+∥∥|B∗t|>ϵβ∗tmax∥∥0)\footnotesize+2KT+1−2∑i[lnP(yi|x(T)i)+lnP(y∗i|x(T)i)],$ (10)

where is the number of nonzero elements in matrix , is a small constant, and and consist of the trained by the first and second iSHMs of the th iSHM pair, respectively, with and as their respective maximum absolute values.

## 4 Illustration and experimental results

To illustrate the imposed geometric constraint and inductive bias of a single iSHM, we first consider a challenging 2-D “two spirals” dataset, as shown Fig. 1, whose two classes are not fully separable by a convex polytope. We train 10 pairs of iSHMs one pair after another, which are organized into a ten-hidden-layer PBDN, whose numbers of hidden units from the 1st to 10th hidden layers (, numbers of support hyperplanes of the 1st to 10th iSHM pairs) are inferred to be 8, 14, 15, 11, 19, 22, 23, 18, 19, and 29, respectively. Both AIC and infers the depth as . We provide detailed explanations for Fig. 1 in Appendix D. We also apply PBDN to four different MNIST binary classification tasks and compare its performance with DNN (128-64), a two-hidden-layer deep neural network that will be described in detail below. As shown in Tab. 3, both AIC and infers the depth as for PBDN, and infer for each class a few active hyperplanes, each of which represents a distinct data subtype, as calculated with (2). The inferred networks of PBDN for all four task has only a single hidden layer with no more than 7 active hidden units. Thus its computation is much lower than DNN (128-64), while providing an overall lower testing error rate. Below we provide a more comprehensive comparison on another eight widely used benchmark datasets.

### 4.1 Comparison of various algorithms with benchmark datasets

We compare the proposed PBDN with logistic regression, Gaussian radial basis function (RBF) kernel support vector machine (SVM), relevance vector machine (RVM) (Tipping, 2001), adaptive multi-hyperplane machine (AMM) (Wang et al., 2011), and convex polytope machine (CPM) (Kantchelian et al., 2014), and the deep neural network (DNN) classifier (DNNClassifier) provided in Tensorflow (Abadi et al., 2015). Except for logistic regression that is a linear classifier, both kernel SVM and RVM are widely used nonlinear classifiers relying on the kernel trick, both AMM and CPM intersect multiple hyperplanes to construct their decision boundaries, and DNN uses a multilayer feedforward network, whose network structure often needs to be tuned to achieve a good balance between data fitting and model complexity, to handle nonlinearity. We consider DNN (8-4), a two-hidden-layer DNN that uses 8 and 4 hidden units for its first and second hidden layers, respectively, DNN (32-16), and DNN (128-64). In the Appendix, we summarize in Tab. 4 the information of eight benchmark datasets, including banana, breast cancer, titanic, waveform, german, image, ijcnn1, and a9a. For a fair comparison, to ensure the same training/testing partitions for all algorithms across all datasets, we report the results by using either widely used open-source software packages or the code made public available by the original authors. We describe in the Appendix the settings of all competing algorithms.

For all datasets, we follow Algorithm 1 in the Appendix to first train a single-hidden-layer PBDN (PBDN1), , a pair of iHSMs fitted under two opposite labeling settings. We then follow Algorithm 2 to train another pair of iHSMs to construct a two-hidden-layer PBDN (PBDN-2), and repeat the same procedure to train PBDN-3 and PBDN-4. Note we observe that PBDN’s log-likelihood increases rapidly during the first few hundred MCMC/SGD iterations, and then keeps increasing at a slower pace and eventually fluctuates. However, it often takes more iterations to shrink the weights of unneeded hyperplanes towards deactivation. Thus although an insufficient number of training iterations may not necessarily degrade the final out-of-sample prediction accuracy, it may lead to a less compact network and hence higher computational cost for out-of-sample prediction. For each iHSM, we set the upper bound on the number of support hyperplanes as . For Gibbs sampling, we run 5000 iterations and record with the highest likelihood during the last 2500 iterations; for MAP, we process 4000 mini-batches of size , with as the Adam learning rate for the th added iSHM pair. We use the inferred to either produce out-of-sample predictions or generate transformed covariates for the next layer. We set , , and for Gibbs sampling. We fix and for MAP inference. As in Algorithm 1, we prune inactive support hyperplanes once every 200 MCMC or 500 SGD iterations to facilitate computation.

We record the out-of-sample-prediction errors and computational complexity of various algorithms over these eight benchmark datasets in Tab. 3 and Tab. 5 of the Appendix, respectively, and summarize in Tab. 3 the means of SVM normalized errors and numbers of support hyperplanes/vectors. Overall, PBDN using AIC in (10) with to determine the depth, refereed to as PBDN-AIC, has the highest out-of-sample prediction accuracy, followed by PBDN4, the RBF kernel SVM, PBDN using AIC in (9) to determine the depth, referred to as PBDN-AIC, PBDN2, DNN (128-64), PBDN-AIC solved with SGD, DNN (32-16), and PBDN-AIC solved with SGD.

Overall, logistic regression does not perform well, which is not surprising as it is a linear classifier that uses a single hyperplane to partition the covariate space into two halves to separate one class from the other. As shown in Tab. 3 of the Appendix, for breast cancer, titanic, german, and a9a, all classifiers have comparable classification errors, suggesting minor or no advantages of using a nonlinear classifier on them. By contrast, for banana, waveform, image, and ijcnn1, all nonlinear classifiers clearly outperform logistic regression. Note PBDN1, which clearly reduces the classification errors of logistic regression, performs similarly to both AMM and CPM. These results are not surprising as CPM, closely related to AMM, uses a convex polytope, defined as the intersection of multiple hyperplanes, to enclose one class, whereas the classification decision boundaries of PBDN1 can be bounded within a convex polytope that encloses negative examples. Note that the number of hyperplanes are automatically inferred from the data by PBDN1, thanks to the inherent shrinkage mechanism of the gamma process, whereas the ones of AMM and CPM are both selected via cross validations. While PBDN1 can partially remedy their sensitivity to how the data are labeled by combining the results obtained under two opposite labeling settings, the decision boundaries of the two iSHMs and those of both AMM and CPM are still restricted to a confined space related to a single convex polytope, which may be used to explain why on banana, image, and ijcnn1, they all clearly underperform a PBDN with more than one hidden layer.

As shown in Tab. 3, DNN (8-4) clearly underperforms DNN (32-16) in terms of classification accuracy on both image and ijcnn1, indicating that having 8 and 4 hidden units for the first and second hidden layers, respectively, is far from enough for DNN to provide a sufficiently high nonlinear modeling capacity for these two datasets. Note that the equivalent number of hyperplanes for DNN (), a two-hidden-layer DNN with and hidden units in the first and second hidden layers, respectively, is computed as Thus the computational complexity quickly increases as the network size increases. For example, DNN (8-4) is comparable to PBDN1 and PBDN-AIC in terms of out-of-sample-prediction computational complexity, as shown in Tabs. 3 and 5, but it clearly underperforms all of them in terms of classification accuracy, as shown in Tab. 3. While DNN (128-64) performs well in terms of classification accuracy, as shown in Tab. 3, its out-of-sample-prediction computational complexity becomes clearly higher than the other algorithms with comparable or better accuracy, such as RVM and the PBDN, as shown in Tab. 5. In practice, however, the search space for a DNN with two or more hidden layers is enormous, making it difficult to determine a network that is neither too large nor too small to achieve a good compromise between fitting the data well and having low complexity for both training and out-of-sample prediction. E.g., while DNN (128-64) could further improve the performance of DNN (32-16) on these two datasets, it uses a much larger network and clearly higher computational complexity.

We show the inferred number of active support hyperplanes by PBDN in a single random trial in Fig. 3. For PBDN, the computation in both training and out-of-sample prediction also increases in , the network depth. It is clear from Tab. 3 that increasing from 1 to 2 generally leads to the most significant improvement if there is a clear advantage of increasing , and once is sufficiently large, further increasing leads to small performance fluctuations but does not appear to lead to clear overfitting. As shown in Tab. 3, the use of the AIC based greedy model selection criterion eliminates the need to tuning the depth , allowing it to be inferred from the data. Note we have tried stacking CPMs as how we stack iSHMs, but found that the accuracy often quickly deteriorates rather than improving. E.g., for CPMs with (2, 3, or 4) layers, the error rates become (0.131, 0.177, 0.223) on waveform, and (0.046, 0.080, 0.216) on image. The reason could be that CPM infers redundant unweighted hyperplanes that lead to strong multicollinearity for the covariates of deep layers.

Note on each given data, we have tried training a DNN with the same network architecture inferred by a PBDN. While a DNN jointly trains all its hidden layers, it provides no performance gain over the corresponding PBDN. More specifically, the DNNs using the network architectures inferred by PBDNs with AIC-Gibbs, AIC-Gibbs, AIC-SGD, and AIC-SGD, have the mean of SVM normalized errors as 1.047, 1.011, 1.076, and 1.144, respectively. There observations suggest the efficacy of the greedy-layer-wise training strategy of the PBDN, which requires no cross-validation.

For out-of-sample prediction, the computation of a classification algorithm generally increases linearly in the number of support hyperplanes/vectors. Using logistic regression with a single hyperplane for reference, we summarize the computation complexity in Tab. 3, which indicates that in comparison to SVM that consistently requires the most number of support vectors, PBDN often requires significantly less time for predicting the class label of a new data sample. For example, for out-of-sample prediction for the image dataset, as shown in Tab. 5, on average SVM uses about 212 support vectors, whereas on average the PBDN with one to five hidden layers use about 13, 16, 29, 50, and 64 hyperplanes, respectively, and PBDN-AIC uses about 22 hyperplanes, showing that in comparison to kernel SVM, PBDN could be much more computationally efficient in making out-of-sample prediction.

## 5 Conclusions

The infinite support hyperplane machine (iSHM), which interacts countably infinite non-negative weighted hyperplanes via a noisy-OR mechanism, is employed as the building unit to greedily construct a capacity-regularized parsimonious Bayesian deep network (PBDN). iSHM has an inductive bias in fitting the positively labeled data, and employs the gamma process to infer a parsimonious set of active hyperplanes to enclose negatively labeled data within a convex-polytope bounded space. Due to the inductive bias and label asymmetry, iSHMs are trained in pairs to ensure a sufficient coverage of the covariate space occupied by the data from both classes. The sequentially trained iSHM pairs can be stacked into a PBDN, a feedforward deep network that gradually enhances its modeling capacity as the network depth increases. While achieving classification accuracy comparable to kernel support vector machines and deep neural networks, using either Gibbs sampling that is suitable for quantifying posterior uncertainty, or mini-batch based stochastic gradient descend MAP inference that is scalable to big data, the PBDN infers a compact network by combining Bayesian nonparametrics and forward model selection, leading to low computational complexity for out-of-sample prediction.

## References

• Abadi et al. [2015] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
• Aiolli and Sperduti [2005] F. Aiolli and A. Sperduti. Multiclass classification with multi-prototype support vector machines. 6:817–850, 2005.
• Arora et al. [2016] S. Arora, R. Ge, T. Ma, and A. Risteski. Provable learning of noisy-or networks. arXiv preprint arXiv:1612.08795, 2016.
• Bishop [1995] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995.
• Caron and Fox [2015] F. Caron and E. B. Fox. Sparse graphs using exchangeable random measures. arXiv:1401.1137v3, 2015.
• Chang and Lin [2011] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
• Chang et al. [2010] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. J. Mach. Learn. Res., 11:1471–1490, 2010.
• Crammer and Singer [2002] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, 2002.
• Diethe [2015] T. Diethe. 13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories.
• Djuric et al. [2013] N. Djuric, L. Lan, S. Vucetic, and Z. Wang. Budgetedsvm: A toolbox for scalable SVM approximations. J. Mach. Learn. Res., 14:3813–3817, 2013.
• Dunson and Herring [2005] D. B. Dunson and A. H. Herring. Bayesian latent variable models for mixed discrete outcomes. Biostatistics, 6(1):11–25, 2005.
• Fan et al. [2008] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res., pages 1871–1874, 2008.
• Ferguson [1973] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230, 1973.
• Glorot et al. [2011] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, pages 315–323, 2011.
• Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
• Hinton et al. [2006] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
• Jordan et al. [1999] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
• Kantchelian et al. [2014] A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J. D. Tygar. Large-margin convex polytope machine. In NIPS, pages 3248–3256, 2014.
• Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
• LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
• Manwani and Sastry [2010] N. Manwani and P. S. Sastry. Learning polyhedral classifiers using logistic function. In ACML, pages 17–30, 2010.
• Manwani and Sastry [2011] N. Manwani and P. S. Sastry. Polyceptron: A polyhedral learning algorithm. arXiv:1107.1564, 2011.
• Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, pages 807–814, 2010.
• Pearl [1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
• Polson and Scott [2011] N. G. Polson and J. G. Scott. Default Bayesian analysis for multi-way tables: a data-augmentation approach. arXiv:1109.4180v1, 2011.
• Polson et al. [2013] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Amer. Statist. Assoc., 108(504):1339–1349, 2013.
• Rätsch et al. [2001] G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for AdaBoost. Machine learning, 42(3):287–320, 2001.
• Schölkopf et al. [1999] B. Schölkopf, C. J. C. Burges, and A. J. Smola. Advances in kernel methods: support vector learning. MIT Press, 1999.
• Shang et al. [2016] W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In ICML, 2016.
• Steinwart [2003] I. Steinwart. Sparseness of support vector machines. J. Mach. Learn. Res., 4:1071–1105, 2003.
• Tipping [2001] M. Tipping. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1:211–244, June 2001.
• Vapnik [1998] V. Vapnik. Statistical learning theory. Wiley New York, 1998.
• Wang et al. [2011] Z. Wang, N. Djuric, K. Crammer, and S. Vucetic. Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification. In KDD, pages 24–32, 2011.
• Zhou [2015] M. Zhou. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS, pages 1135–1143, 2015.
• Zhou [2018] M. Zhou. Discussion on “sparse graphs using exchangeable random measures” by Francois Caron and Emily B. Fox. arXiv preprint arXiv:1802.07721, 2018.
• Zhou and Carin [2015] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell., 37(2):307–320, 2015.
• Zhou et al. [2012a] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor analysis. In AISTATS, pages 1462–1471, 2012a.
• Zhou et al. [2012b] M. Zhou, L. Li, D. Dunson, and L. Carin. Lognormal and gamma mixed negative binomial regression. In ICML, pages 1343–1350, 2012b.

Parsimonious Bayesian deep networks: supplementary material

## Appendix A Proofs

###### Proof of Theorem 1.

Since a.s., if (6) is true, then a.s. for all . Thus if (6) is true, then (7) is true a.s., which means the set of solutions to (6) is included in the set of solutions to (7). ∎

## Appendix B Related multi-hyperplane models

Generalizing the construction of multiclass support vector machines in [Crammer and Singer, 2002], the idea of combining multiple hyperplanes to define complex classification decision boundaries has been discussed before [Aiolli and Sperduti, 2005, Wang et al., 2011, Manwani and Sastry, 2010, 2011, Kantchelian et al., 2014]. In particular, the convex polytope machine (CPM) of [Kantchelian et al., 2014] exploits the idea of learning a convex polytope to separate one class from the other. From this point of view, the proposed iSHM is related to the CPM as its decision boundary can be explicitly bounded by a convex polytope that encloses the data labeled as zeros, as described in Theorem 1 and illustrated in Fig. 1. Distinct from the CPM that uses a convex polytope as its decision boundary, and provides no probability estimates for class labels and no principled ways to set its number of equally-weighted hyperplanes, iSHM makes its decision boundary smoother than the corresponding bounding convex polytope, as shown in Figs. 1 (c) and (f), by using more complex interactions between hyperplanes than simple intersection. iSHM also provides probability estimates for its labels, and supports countably infinite differently-weighted hyperplanes with the gamma process. In addition, to solve its non-convex objective function, the CPM relies on heuristics to force the learning of each hyperplane as a convex optimization problem, whereas iSHM uses Bayesian inference, in which each data point assigns a binary indicator to each hyperplane. Moreover, iSHM pair is used as the building unit to construct PBDN to quickly boost the modeling power, as will be shown below.

## Appendix C Gibbs sampling update equations

To begin with, we sample the latent count and then partitions it into for different hyperplanes, where the value of is related to how likely in the posterior, and the ratio is related to how much does expert contribute to the overall cause of . Below we first describe the Gibbs sampling update equations for and .

Sample . Denote  . Since a.s. given and given , and in the prior , following the inference in [Zhou, 2015], we can sample as

 (11)

where denotes a draw from the zero-truncated Poisson distribution.

Sample . Once the latent counts are known, it becomes clear on how much expert contributes to the cause of . Since letting is equivalent in distribution to letting , similar to Dunson and Herring [2005] and Zhou et al. [2012a], we sample as

 (mi1,…,miK|−)∼Mult(mi,θi1/θi⋅,…,θiK/θi⋅).\vspace0mm (12)

The key remaining problem is to infer . Note that marginalizing out from (5) leads to

 mik∼NB[rk,1/(1+e−x′iβk)],\vspace0mm (13)

where represents a negative binomial (NB) distribution with shape and probability . We thus exploit the augmentation techniques developed for the NB distribution in [Zhou and Carin, 2015] to sample , and these developed for logistic regression in [Polson and Scott, 2011] and further generalized to NB regression in [Zhou et al., 2012b] and [Polson et al., 2013] to sample . We outline Gibbs sampling in Algorithm 1, where to save computation, we set as the upper-bound of the number of experts and prune the experts assigned with zero counts during MCMC iterations. Note that except for the sampling of , the sampling of all the other parameters of different experts are embarrassingly parallel.

Gibbs sampling via data augmentation and marginalization proceeds as follows.

Sample . Using data augmentation for NB regression, as in [Zhou et al., 2012b] and [Polson et al., 2013], we denote as a random variable drawn from the Polya-Gamma (PG) distribution [Polson and Scott, 2011] as under which we have . Since , the likelihood of can be expressed as

 L(ψik)∝(eψik)mik(1+eψik)mik+θik =2−(mik+θik)exp(mik−θik2ψik)coshmik+θik(ψik/2) ∝exp(mik−θik2ψi)Eωik[exp[−ωik(ψik)2/2]].

Combining the likelihood and the prior, we sample auxiliary Polya-Gamma random variables as

 (ωik|−)∼PG(mik+rk, x′iβk),\vspace0mm (14)

conditioning on which we sample as

 (βk|−)∼N(μk,Σk), Σk=(diag(α1k,…,αVk)+∑iωikxix′i)−1, μk=Σk[∑i(mik−rk2)xi]. (15)

Sample . Using the gamma-Poisson conjugacy, we sample as

 (θik|−)∼Gamma(rk+mik,