Adaptive Network Sparsification via
Dependent Variational BetaBernoulli Dropout
Abstract
While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such inputindependent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsityinducing betaBernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such inputadaptive sparsityinducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational betaBernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.
Adaptive Network Sparsification via
Dependent Variational BetaBernoulli Dropout
Juho Lee, Saehoon Kim , Jaehong Yoon, Hae Beom Lee, Eunho Yang, Sung Ju Hwang UNIST, Ulsan, South Korea, KAIST, Daejeon, South Korea, AITRICS, Seoul, South Korea, University of Oxford, Oxford, United Kingdom juho.lee@stats.ox.ac.uk, shkim@aitrics.com, jaehong.yoon@kaist.ac.kr, hblee@unist.ac.kr, eunhoy@kaist.ac.kr, sjhwang82@kaist.ac.kr
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
One of the main obstacles in applying deep learning to largescale problems and lowpower computing systems is the large number of network parameters, as it can lead to excessive memory and computational overheads. To tackle this problem, researchers have explored network sparsification methods to remove unnecessary connections in a network, which is implementable either by weight pruning han2016deep () or sparsityinducing regularizations wen2016learning ().
Recently, variational Bayesian approaches have shown to be useful for network sparsification, outperforming nonBayesian counterparts. They take a completely different approach from the conventional methods that either uses thresholding or sparsityinducing norms on parameters, and uses wellknown dropout regularization instead. Specifically, these approaches use variational dropout kingma2015variational () which adds in multiplicative stochastic noise to each neuron, as a means of obtaining sparse neural networks. Removal of unnecessary neurons could be done by either setting the dropout rate individually for each neuron with unbounded dropout rate molchanov2017variational () or by pruning based on the signaltonoise ratio neklyudov2017structured ().
While these variational dropout approaches do yield compact networks, they are suboptimal in that the dropout rate for each neuron is learned completely independently of the given input data and labels. With inputindependent dropout regularization, each neuron has no choice but to encode generic information for all possible inputs, since it does not know what input and tasks it will be given at evaluation time, as each neuron will be retained with fixed rate regardless of the input. Obtaining high degree of sparsity in such as setting will be difficult as dropping any of the neurons will result in information loss. For maximal utilization of the network capacity and thus to obtain a more compact model, however, each neuron should be either irreplaceably generic and used by all tasks, or highly specialized for a task such that there exists minimal redundancy among the learned representations. This goal can be achieved by adaptively setting the dropout probability for each input, such that some of the neurons are retained with high probability only for certain types of inputs and tasks.
To this end, we propose a novel inputdependent variational dropout regularization for data and taskdependent network sparsification. We first propose betaBernoulli dropout that learns to set dropout rate for each individual neuron, by generating the dropout mask from betaBernoulli prior, and show how to train it using variational inference. This dropout regularization is a proper way of obtaining a Bayesian neural network and also sparsifies the network, since betaBernoulli distribution is a sparsityinducing prior. Then, we propose dependent betaBernoulli dropout, which is an inputdependent version of our variational dropout regularization.
Such adaptive regularization has been utilized for general network regularization by a nonBayesian and nonsparsityinducing model ba2013adaptive (); yet, the increased memory and computational overheads that come from learning additional weights for dropout mask generation made it less appealing for generic network regularization. In our case of network sparsification, however, the overheads at training time is more than rewarded by the reduced memory and computational requirements at evaluation time, thanks to the high degree of sparsification obtained in the final output model.
We validate our dependent betaBernoulli variational dropout regularizer on multiple public datasets for network sparsification performance and prediction error, on which it obtains significantly more compact network ( memory reduction and speedup on CIFAR100) with substantially reduced prediction errors, when compared with both the base network and existing network sparsification methods. Further analysis of the learned dropout probability for each unit reveals that our inputadaptive variational dropout approach generates a clearly distinguishable dropout mask for each task, thus enables each task to utilize different sets of neurons for their specialization.
Our contribution in this paper is threefold:

We propose betaBernoulli dropout, a novel dropout regularizer which learns to generate Bernoulli dropout mask for each neuron with sparsityinducing prior, that obtains high degree of sparsity without accuracy loss.

We further propose dependent betaBernoulli dropout, which yields significantly more compact network than inputindependent betaBernoulli dropout, and further perform runtime pruning for even less computational cost.

Our betaBernoulli dropout regularizations provide novel ways to implement a sparse Bayesian Neural Network, and we provide a variational inference framework for learning it.
2 Related Work
Deep neural networks are known to be prone to overfitting, due to its large number of parameters. Dropout srivastava2014dropout () is an effective regularization that helps prevent overfitting by reducing coadaptations of the units in the networks. During dropout training, the hidden units in the networks are randomly dropped with fixed probability , which is equivalent to multiplying the Bernoulli noises to the units. It was later found that multiplying Gaussian noises with the same mean and variance, , works just as well or even better srivastava2014dropout ().
Dropout regularizations generally treat the dropout rate as a hyperparameter to be tuned, but there have been several studies that aim to automatically determine proper dropout rate. kingma2015variational () propose to determine the variance of the Gaussian dropout by stochastic gradient variational Bayes. Generalized dropout srinivas2016generalized () places a beta prior on the dropout rate and learn the posterior of the dropout rate through variational Bayes. They showed that by adjusting the hyperparameters of the beta prior, we can obtain several regularization algorithms with different characteristics. Our betaBernoulli dropout is similar to one of its special cases, but while they obtain the dropout estimates via pointestimates and compute the gradients of the binary random variables with biased heuristics, we approximate the posterior distribution of the dropout rate with variational distributions and compute asymptotically unbiased gradients for the binary random variables.
Ba et al. ba2013adaptive () proposed adaptive dropout (StandOut), where the dropout rates for each individual neurons are determined as function of inputs. This idea is similar in spirit to our dependent betaBernoulli dropout, but they use heuristics to model this function, while we use proper variational Bayesian approach to obtain the dropout rates. One drawback of their model is the increased memory and computational cost from additional parameters introduced for dropout mask generation, which is not negligible when the network is large. Our model also requires additional parameters, but with our model the increased cost at training time is rewarded at evaluation time, as it yields a significantly sparse network than the baseline model as an effect of the sparsityinducing prior.
Recently, there has been growing interest in structure learning or sparsification of deep neural networks. Han et al. han2016deep () proposed a strategy to iteratively prune weak network weights for efficient computations, and Wen et al. wen2016learning () proposed a group sparsity learning algorithm to drop neurons, filters or even residual blocks in deep neural networks. In Bayesian learning, various sparsity inducing priors have been demonstrated to efficiently prune network weights with little drop in accuracies molchanov2017variational (); louizos2017bayesian (); neklyudov2017structured (); louizos2018learning (). In the nonparametric Bayesian perspective, Feng et al. feng2015learning () proposed IBP based algorithm that learns proper number of channels in convolutional neural networks using the asymptotic smallvariance limit approximation of the IBP. While our dropout regularizer is motivated by IBP as with this work, our work is differentiated from it by the inputadaptive adjustments of dropout rates that allow each neuron to specialize into features specific for some subsets of tasks.
3 Backgrounds
3.1 Bayesian Neural Networks and Stochastic Gradient Variational Bayes
Suppose that we are given a neural network parametrized by , a training set , and a likelihood chosen according to the problem of interest (e.g., the categorical distribution for a classification task). In Bayesian neural networks, the parameter is treated as a random variable drawn from a prespecified prior distribution , and the goal of learning is to compute the posterior distribution :
(1) 
When a novel input is given, the prediction is obtained as a distribution, by mixing from as follows:
(2) 
Unfortunately, is in general computationally intractable due to computing , and thus we resort to approximate inference schemes. Specifically, we use variational Bayes (VB), where we posit a variational distribution of known parametric form and minimize the KLdivergence between it and the true posterior , . It turns out that minimizing is equivalent to maximizing the evidence lowerbound (ELBO),
(3) 
where the first term measures the expected loglikelihood of the dataset w.r.t. , and the second term regularizes so that it does not deviate too much from the prior. The parameter is learned by gradient descent, but these involves two challenges. First, the expected likelihood is intractable in many cases, and so is its gradient. To resolve this, we assume that is reparametrizable, so that we can obtain i.i.d. samples from by computing differentiable transformation of i.i.d. noise (kingma2014auto, ; rezende2014stochastic, ) as . Then we can obtain a lowvariance unbiased estimator of the gradient, namely
(4) 
The second challenge is that the number of training instances may be too large, which makes it impossible to compute the summation of all expected loglikelihood terms. Regarding on this challenge, we employ the stochastic gradient descent technique where we approximate with the summation over a uniformly sampled minibatch ,
(5) 
Combining the reparametrization and the minibatch sampling, we obtain an unbiased estimator of to update . This procedure, often referred to as stochastic gradient variational Bayes (SGVB) kingma2014auto (), is guaranteed to converge to local optima under proper learningrate scheduling.
3.2 Latent feature models and Indian Buffet Processes
In latent feature model, data are assumed to be generated as combinations of latent features:
(6) 
where means that possesses the th feature , and is an arbitrary function.
The Indian Buffet Process (IBP) griffiths2005infinite () is a generative process of binary matrices with infinite number of columns. Given data points, IBP generates a binary matrix whose th row encodes the feature indicator . The IBP is suitable to use as a prior process in latent feature models, since it generates possibly infinite number of columns and adaptively adjust the number of features on given dataset. Hence, with an IBP prior we need not specify the number of features in advance.
One interesting observation is that while it is a marginal of the betaBernoulli processes (thaibux2007hierarchical, ), the IBP may also be understood as a limit of the finitedimensional betaBernoulli process. More specifically, the IBP with parameter can be obtained as
(7) 
This betaBernoulli process naturally induces sparsity in the latent feature allocation matrix . As , the expected number of nonzero entries in converges to (griffiths2005infinite, ) , where is a hyperparameter to control the overall sparsity level of .
In this paper, we relate the latent feature models (6) to neural networks with dropout mask. Specifically, the binary random variables correspond to the dropout indicator, and the features correspond to the inputs or intermediate units in neural networks. From this connection, we can think of a hierarchical Bayesian model where we place the IBP, or finitedimensional betaBernoulli priors for the binary dropout indicators. We expect that due to the property of the IBP favoring sparse model, the resulting neural network would also be sparse.
3.3 Dependent Indian Buffet Processes
One important assumption in the IBP is that features are exchangeable  the distribution is invariant to the permutation of feature assignments. This assumption makes the posterior inference convenient, but restricts flexibility when we want to model the dependency of feature assignments to the input covariates , such as times or spatial locations.
To this end, Williamson et al. williamson2010dependent () proposed dependent Indian Buffet processes (dIBP), which triggered a line of followup work (zhou2011dependent, ; ren2011kernel, ). These models can be summarized as following generative process:
(8) 
where is an arbitrary function that maps and to a probability. In our latent feature interpretation for neural network layers above, the input covariates corresponds to the input or activations in the previous layer. In other words, we build a datadependent dropout model where the dropout rates depend on the inputs. In the main contribution section, we will further explain how we will construct this datadependent dropout layers in detail.
4 Main contribution
4.1 Variational BetaBernoulli dropout
Inspired by the latentfeature model interpretation of layers in neural networks, we propose a Bayesian neural network layer overlaid with binary random masks sampled from the finitedimensional betaBernoulli prior. Specifically, let be a parameter of a neural network layer, and let be a binary mask vector to be applied for the th observation . The dimension of needs not be equal to . Instead, we may enforce arbitrary group sparsity by sharing the binary masks among multiple elements of . For instance, let be a parameter tensor in a convolutional neural network with channels. To enforce a channelwise sparsity, we introduce of dimension, and the resulting masked parameter for the th observation is given as
(9) 
where is the th element of . From now on, with a slight abuse of notation, we denote this binary mask multiplication as
(10) 
with appropriate sharing of binary mask random variables. The generative process of our Bayesian neural network is then described as
(11) 
Note the difference between our model and the model in gal2016dropout (). In gal2016dropout (), only Gaussian prior is placed on the parameter , and the dropout is applied in the variational distribution to approximate . Our model, on the other hand, includes the binary mask in the prior, and the posterior for the binary masks should also be approximated.
The goal of the posterior inference is to compute the posterior distribution , where . We approximate this posterior with the variational distribution of the form
(12) 
For , we conduct computationally efficient pointestimate to get the single value , with the weightdecay regularization arising from the zeromean Gaussian prior. For , following nalisnick2017stick (), we use the Kumaraswamy distribution (kumaraswamy1980generalized, ) for :
(13) 
since it closely resembles the beta distribution and easily reparametrizable as
(14) 
We further assume that . A sample from can be obtained by reparametrization based on continuous relaxation (maddison2017concrete, ; jang2017categorical, ; gal2017concrete, ),
(15) 
where is a temperature of continuous relaxation, , and . The KLdivergence between the prior and the variational distribution is then obtained in closed form as follows (nalisnick2017stick, ):
(16) 
where is EulerMascheroni constant and is the digamma function.
Having all the above ingredients, we can apply the SGVB framework described in Section 3.1 to optimize the variational parameters . After the training, the prediction for a novel input is given as
(17) 
and we found that the following näive approximation works well in practice,
(18) 
where
(19) 
4.2 Variational Dependent BetaBernoulli Dropout
Now we describe our Bayesian neural network model with input dependent betaBernoulli dropout prior constructed as follows:
(20) 
Here, is the input to the dropout layer. For convolutional layers, we apply the global average pooling to tensors to get vectorized inputs. In principle, we may introduce another fully connected layer as , with additional parameters and , but this is undesirable for the network sparsification. Rather than adding parameters for fully connected layer, we propose simple yet effective way to generate inputdependent probability, with minimal parameters involved. Specifically, we construct each independently as follows:
(21) 
where and are the estimates of th components of mean and standard deviation of inputs, and and are scaling and shifting parameters to be learned, and is some small tolerance to prevent overflow.
The parameterization in (21) is motivated by the batch normalization (ioffe2015batch, ). The intuition behind this construction is as follows. Provided that we have good estimates of , the inputs after the standardization would approximately be distributed as , so the inputs would be centered around zero, with insignificant values being closed to zero or negative. Hence, if we pass them through , outputs for the insignificant dimensions would be close to zero, However, some inputs may be important for the final classification regardless of the significance. In that case, we expect the corresponding shifting parameter to be large. Thus by we control the overall sparsity of the dropout layer, but we want them to be small unless required to get sparse outcomes. We enforce this by placing a prior distribution on .
The goal of variational inference is hence to learn the posterior distribution , and we approximate this with variational distribution of the form
(22) 
where are the same as in betaBernoulli dropout, , and ^{1}^{1}1 In principle, we may introduce an inference network and minimizes the KLdivergence between and , but this results in discrepancy between training and testing for sampling , and also make optimization cumbersome. Hence, we chose to simply set them equal. Please refer to sohn2015learning () for discussion about this. The KLdivergence is computed as
(23) 
where the first term was described for betaBernoulli dropout and the second term can be computed analytically.
The prediction for the novel input is similarity done as in the betaBernoulli dropout, with the näive approximation for the expectation:
(24) 
where
(25) 
Two stage pruning scheme
Since for all , we expect the resulting network to be sparser than the network pruned only with the betaBernoulli dropout (only with ). To achieve this, we propose a twostage pruning scheme, where we first prune the network with betaBernoulli dropout, and prune the network again with while holding the variables fixed. By fixing the resulting network is guaranteed to be sparser than the network before the second pruning. To save memories, once trained, we preprune all the units whose values were below threshold for every training instances . We found that this saves memory without any accuracy drop in testing.
5 Experiments
We now compare our betaBernoulli dropout (BB) and inputdependent betaBernoulli dropout (DBB) to other structure learning/pruning algorithms on several neural networks using benchmark datasets.
Experiment Settings
We followed a common setting to compare pruning algorithms by using LeNet 500300, LeNet 5Caffe ^{2}^{2}2https://github.com/BVLC/caffe/blob/master/examples/mnist, and VGGlike (zagoruyko2015torch, ) networks on MNIST lecun1998gradient (), CIFAR10, and CIFAR100 datasets (krizhevsky2009tr, ). We included recent Bayesian pruning methods for a fair comparison: sparse variational dropout (SVD molchanov2017variational ()), structured sparsity learning (SSL wen2016learning ()) and structured Bayesian pruning (SBP neklyudov2017structured ()). We faithfully tuned all hyperparameters of baseline methods on a validation set to find a reasonable solution that is well balanced between accuracy and sparsification, while fixing batch size (100) and the number of maximum epochs (300) to match our experiment setting.
Implementation Details
We pretrained all networks using the standard training procedure before finetuning for network sparsification molchanov2017variational (); neklyudov2017structured (). While pruning, we set the learning rate for the weights to be 0.1 times smaller than those for the variational parameters as in neklyudov2017structured (). We used Adam (kingma2015adam, ) for optimization of both BB and DBB. For DBB, as mentioned in Section 4.2, we first prune networks with BB, and then prune again with DBB whiling holding the variational parameters for fixed.
We report all hyperparameters of BB and DBB for reproducing our results. We set for all layers of BB and DBB. In principle, we may fix to be large number and tune . However, in the network sparsification tasks, is given as the neurons/filters to be pruned. Hence, we chose to set the ratio to be small number altogether. In the testing phase, we pruned the neurons/filters whose expected dropout mask probability are smaller than a filxed threshold ^{3}^{3}3We tried different values such as or , but the difference was insignificant.. For the inputdependent dropout, since the number of pruned neurons/filters differ according to the inputs, we report them as the running average over the test data. We fixed the temperature parameter of concrete distribution and the prior variance of , for all experiments.
LeNet 500300  LeNet5Caffe  

Error  Neurons  Speedup  Memory  Error  Neurons/Filters  Speedup  Memory  
Original  1.69  784500300  1.0x  100.0%  0.81  2050800500  1.0x  100.0% 
SVD molchanov2017variational ()  1.43  5375931  16.2x  6.18%  0.7  113126327  2.9x  3.74% 
SSL wen2016learning ()  2.25  5056817  15.3x  6.55%  1.12  31717263  17.1x  2.86% 
SBP neklyudov2017structured ()  1.7  21910240  20.5x  4.87%  0.82  131510050  4.5x  2.4% 
BB  1.36  26414065  11.8x  8.50%  0.55  153313864  2.0x  5.07% 
DBB  1.41  1003634  112.4x  8.21%  0.59  13315334  2.5x  4.84% 
[width=0.2]lenet_dense_cavgs.pdf \includegraphics[width=0.65]lenet_conv_corr.pdf
5.1 Experiments on MNIST dataset
We used LeNet 500300 and LeNet 5Caffe networks on MNIST for comparison. Following the conventions, we applied dropout to the inputs to the fully connected layers and right after the convolution for the convolutional layers. The results are summarized in Fig. 1. For both networks, BB achieved the highest accuracy significantly improved over the original networks. DBB produced much sparser networks than those pruned with BB with little accuracy drops (see memory usage), and further perform runtime pruning which significantly reduced the number of neurons to consider, to obtain large gains in speedups (as much as 112.4x on LeNet 500300). Note that for LeNet5Caffe, BB and DBB tend to prune fullyconnected layers rather than convolutional layers. This is due to the number of dropout mask applies for fullyconnected layers is much larger, which results in much stronger KL divergence term. We may resolve this by scaling the KL term for the convolutional layers with constant , and in that case we optimize the lower bound on the original ELBO.
On LeNet 500300, DBB pruned large amount of neurons in input layer, because the inputs to this network are simply vectorized pixels. DBB adaptively prunes the region where there are no pixel values, while other inputindependent pruning algorithms prune the general background areas (Fig. 1, bottom left). Notice that in general, DBB tends to prune less neurons/filters in lower layers of the networks, and tends to prune more neurons/filters in higher layers (Fig. 1). Further, the droput masks generated by DBB tend to be generic at lower network layers to extract common features, but become classspecific at higher layers to specialize features for class discriminativity. This phenomenon is clearly shown in (Fig. 1, bottom right), which plots the correlation between the classaverage dropout masks at each layer of LeNet5Caffe.
5.2 Experiments on CIFAR10 and CIFAR100 datasets
We compared the pruning algorithms on VGGlike network adapted for CIFAR10 and CIFAR100 datasets. For CIFAR10, we ran the algorithms for 200 epochs with batch size 100, and for CIFAR100 we ran for 300 epochs with batch size 100. Table 1 summarizes the performance of each algorithm, where BB and DBB achieved impressive sparsity with significantly improved accuracies. Especially, the network pruned with DBB showed similar error to the one pruned with BB with much less number of filters used. Further analysis of the filters retained by DBB in Fig. 2 shows that DBB either retains most filters (layer 3) or perform generic pruning (layer 8) at lower layers, while performing diversified pruning at higher layers (layer 15). Further, at layer 15, instances from the same class retained similar filters, while instances from different classes retained different filters.
6 Conclusion
We have proposed novel betaBernoulli dropout for network regularization and sparsification, where we learn dropout probabilities for each neuron either in an inputindependent or inputdependent manner. Our betaBernoulli dropout learns the distribution of sparse Bernoulli dropout mask for each neuron in a variational inference framework, in contrast to existing work that learned the distribution of Gaussian multiplicative noise or weights, and obtains significantly more compact network compared to those competing approaches. Further, our dependent betaBernoulli dropout that inputadaptively decides which neuron to drop further improves on the inputindependent betaBernoulli dropout, both in terms of size of the final network obtained and runtime computations.
References
 [1] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems 26, 2013.
 [2] J. Feng and T. Darrell. Learning the structure of deep convolutional networks. IEEE International Conference on Computer Vision, 2015.
 [3] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 [4] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. Advances in Neural Information Processing Systems, 2017.
 [5] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, 2005.
 [6] S. Han, H. Mao, and W. J. Dally. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the International Conference on Learning Representations, 2016.
 [7] S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
 [8] E. Jang, S. Gu, and B. Poole. Categorical reparametrization with Gumbelsoftmax. In Proceedings of the International Conference on Learning Representations, 2017.
 [9] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
 [10] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick. In Advances in Neural Information Processing Systems 28, 2015.
 [11] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In Proceedings of the International Conference on Learning Representations, 2014.
 [12] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009.
 [13] P. Kumaraswamy. A generalized probability density function for doublebounded random processes. Journal of Hydrology, 1980.
 [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [15] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. Advances in Neural Information Processing Systems, 2017.
 [16] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. International Conference on Learning Representations, 2018.
 [17] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: a continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations, 2017.
 [18] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 [19] E. Nalisnick and P. Smyth. Stickbreaking variational autoencoders. In Proceedings of the International Conference on Learning Representations, 2017.
 [20] K. Neklyudov, D. Molchanov, A. Ashukha, and D. Vetrov. Structured Bayesian pruning via lognormal multiplicative noise. Advances in Neural Information Processing Systems, 2017.
 [21] L. Ren, Y. Wang, D. B. Dunson, and L. Carin. The kernel beta process. In Advances in Neural Information Processing Systems 24, 2011.
 [22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, 2014.
 [23] K. Sohn, H. Lee, and X. Yan. Learning structured ouput representation using deep conditional generative models. Advances in Neural Information Processing Systems 28, 2015.
 [24] S. Srinivas and R. V. Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
 [25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [26] R. Thibaux and M. I. Jordan. Hierarchical beta processess and the Indian buffet processes. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2007.
 [27] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29, 2016.
 [28] S. Williamson, P. Orbanz, and Z. Ghahramani. Dependent indian buffet processes. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
 [29] S. Zagoruyko. 92.45 on CIFAR10 in Torch. Technical report, 2015.
 [30] M. Zhou, H. Yang, G. Sapiro, and D. B. Dunson. Dependent hierarchical beta process for image interpolation and denoising. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011.